useSpeechToText
Speech to text (STT) converts spoken audio into written text. This hook allows you to implement features like voice assistants, real-time transcription, and audio file processing directly on-device.
We recommend using our optimized models available on Hugging Face. You can also use pre-defined constants included in the library.
API Reference
Basic Usage (File Transcription)
Use transcribe for processing pre-recorded audio or short clips. The input should be a Float32Array of audio samples at 16 kHz.
Transcribe Options
The transcribe() function accepts an optional configuration object:
language: The language code (e.g.,'es','fr'). Required for multilingual models.verbose: Iftrue, the method returns a detailedTranscriptionResultobject following the OpenAI Whisperverbose_jsonformat (including segments and word-level timestamps).
In this example, we use react-native-audio-api to decode an audio file into the required format.
Example
import { models, useSpeechToText } from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';
import * as FileSystem from 'expo-file-system';
const model = useSpeechToText({
model: models.speech_to_text.whisper_tiny_en(), // Use whisper_tiny_en for English or whisper_tiny for multilingual support
});
// 1. Get audio file
const { uri } = await FileSystem.downloadAsync(
'https://some-audio-url.com/file.mp3',
`${FileSystem.cacheDirectory}audio_file`
);
// 2. Decode to 16kHz PCM Float32Array
const audioContext = new AudioContext({ sampleRate: 16000 });
const decodedAudioData = await audioContext.decodeAudioData(uri);
const audioBuffer = decodedAudioData.getChannelData(0);
// 3. Transcribe
try {
const result = await model.transcribe(audioBuffer);
console.log('Transcription:', result.text);
} catch (error) {
console.error('Transcription failed:', error);
}
Live Streaming Transcription
For real-time applications or audio streams of arbitrary length, use the Streaming API. This is optimized for live input, handling the 30-second window limitation of Whisper models automatically to ensure context isn't lost between chunks.
How it works:
- Feed audio: Use
streamInsertto push small chunks of audio (e.g., 100ms) as they arrive from the microphone. - Get results: The
streamgenerator yields two types of text:committed: Finalized text that won't change.nonCommitted: Temporary text that might update as the model gets more context from the audio.
Streaming Options
The stream() function accepts several optional parameters:
language: The language code (e.g.,'es','fr'). Required for multilingual models.verbose: Iftrue, includes word-level timestamps and segment metadata in the result objects.useVAD: Enable the Voice Activity Detection submodule (if configured inuseSpeechToTextprops) to optimize performance by filtering silence. Defaults tofalse.timeout: (Advanced) The interval (in milliseconds) between processing consecutive audio chunks in streaming mode. Lower values provide more frequent updates and lower latency, while higher values reduce CPU consumption. Defaults to100.vadDetectionMargin: (Advanced) The duration of silence (in milliseconds) required after speech is detected before "committing" a segment. Defaults to500. Only active when VAD module is used.
Voice Activity Detection (VAD)
Integrating a VAD submodule is highly recommended for streaming. It improves performance by automatically removing silence, which reduces CPU usage, saves battery, and prevents the model from "hallucinating" text during silent periods.
Example
import React, { useEffect, useState, useRef } from 'react';
import { Text, Button, View, SafeAreaView } from 'react-native';
import { models, useSpeechToText } from 'react-native-executorch';
import { AudioManager, AudioRecorder } from 'react-native-audio-api';
export default function LiveTranscriber() {
const model = useSpeechToText({
model: models.speech_to_text.whisper_tiny_en(),
vad: models.vad.fsmn_vad(),
});
const [text, setText] = useState('');
const isRecordingRef = useRef(false);
const [recorder] = useState(() => new AudioRecorder());
const startLiveStreaming = async () => {
isRecordingRef.current = true;
setText('');
// 2. Capture microphone input
recorder.onAudioReady(
{ sampleRate: 16000, bufferLength: 1600, channelCount: 1 },
(chunk) => model.streamInsert(chunk.buffer.getChannelData(0))
);
await recorder.start();
// 3. Process the stream with VAD enabled
try {
let finalizedText = '';
const streamIter = model.stream({
verbose: false,
useVAD: true, // Enable VAD filter
vadDetectionMargin: 500, // Wait for 500ms of silence before committing
});
for await (const { committed, nonCommitted } of streamIter) {
if (!isRecordingRef.current) break;
if (committed.text) finalizedText += committed.text;
setText(finalizedText + nonCommitted.text);
}
} catch (error) {
console.error('Streaming error:', error);
}
};
const stopLiveStreaming = () => {
isRecordingRef.current = false;
recorder.stop();
model.streamStop();
};
return (
<SafeAreaView>
<Text>{text || 'Press start and speak...'}</Text>
<Button
onPress={startLiveStreaming}
title="Start Live"
disabled={model.isGenerating}
/>
<Button onPress={stopLiveStreaming} title="Stop" color="red" />
</SafeAreaView>
);
}
Advanced Features
Multilingual Transcription
To transcribe languages other than English, use a multilingual model (e.g., models.speech_to_text.whisper_tiny()) and specify the corresponding language code:
// Transcribe in Spanish
const model = useSpeechToText({
model: models.speech_to_text.whisper_tiny(),
});
const result = await model.transcribe(spanishAudio, { language: 'es' });
Timestamps & Metadata
Set verbose: true to receive word-level timestamps and confidence scores. The output follows the OpenAI Whisper verbose_json format.
const result = await model.transcribe(audioBuffer, { verbose: true });
// result.segments[0].words -> [{ word: "Hello", start: 0.5, end: 1.0 }, ...]
Configuration
Arguments
useSpeechToText accepts a configuration object:
model: Model source and tokenizer settings (see ModelConfig).preventLoad: (Optional) Iftrue, the model won't load until you callload().
Returns
The hook returns an object with:
transcribe(audio, options): One-shot transcription.stream(options): Async generator for streaming results.streamInsert(audio): Push audio to the stream buffer.streamStop(): Finish the current stream.isGenerating: Boolean indicating if the model is busy.loading: Boolean indicating if the model is being loaded.
Supported models
| Model | Language |
|---|---|
| whisper-tiny.en | English |
| whisper-tiny | Multilingual |
| whisper-base.en | English |
| whisper-base | Multilingual |
| whisper-small.en | English |
| whisper-small | Multilingual |