Skip to main content
Version: Next

useTextToSpeech

Text to speech is a task that allows to transform written text into spoken language. It is commonly used to implement features such as voice assistants, accessibility tools, or audiobooks.

warning

It is recommended to use models provided by us, which are available at our Hugging Face repository. You can also use constants shipped with our library.

API Reference

High Level Overview

You can play the generated waveform in any way most suitable to you; however, in the snippet below we utilize the react-native-audio-api library to play synthesized speech.

import {
useTextToSpeech,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';

const model = useTextToSpeech({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

const audioContext = new AudioContext({ sampleRate: 24000 });

const handleSpeech = async (text: string) => {
const speed = 1.0;
const waveform = await model.forward(text, speed);

const audioBuffer = audioContext.createBuffer(1, waveform.length, 24000);
audioBuffer.getChannelData(0).set(waveform);

const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
};

Arguments

useTextToSpeech takes TextToSpeechProps that consists of:

You need more details? Check the following resources:

Returns

useTextToSpeech returns an object called TextToSpeechType containing bunch of functions to interact with TTS. To get more details please read: TextToSpeechType API Reference.

Running the model

The module provides two ways to generate speech using either raw text or pre-generated phonemes:

Using Text

  1. forward({ text, speed }): Generates the complete audio waveform at once. Returns a promise resolving to a Float32Array.
  2. stream({speed, stopAutomatically, onNext, ...}): An async generator-like functionality (managed via callbacks like onNext) that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences. You can also dynamically insert text during the generation process using streamInsert(text) and stop it with streamStop(instant).

Using Phonemes

If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:

  1. forwardFromPhonemes({ phonemes, speed }): Generates the complete audio waveform from a phoneme string.
  2. streamFromPhonemes({ phonemes, speed, onNext, ... }): Streams audio chunks generated from a phoneme string.
note

Since forward and forwardFromPhonemes process the entire input at once, they might take a significant amount of time to produce audio for long inputs.

Example

Speech Synthesis

import React from 'react';
import { Button, View } from 'react-native';
import {
useTextToSpeech,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';

export default function App() {
const tts = useTextToSpeech({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

const generateAudio = async () => {
const audioData = await tts.forward({
text: 'Hello world! This is a sample text.',
});

// Playback example
const ctx = new AudioContext({ sampleRate: 24000 });
const buffer = ctx.createBuffer(1, audioData.length, 24000);
buffer.getChannelData(0).set(audioData);

const source = ctx.createBufferSource();
source.buffer = buffer;
source.connect(ctx.destination);
source.start();
};

return (
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
<Button title="Speak" onPress={generateAudio} disabled={!tts.isReady} />
</View>
);
}

Streaming Synthesis

import React, { useRef } from 'react';
import { Button, View } from 'react-native';
import {
useTextToSpeech,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';

export default function App() {
const tts = useTextToSpeech({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

const contextRef = useRef(new AudioContext({ sampleRate: 24000 }));

const generateStream = async () => {
const ctx = contextRef.current;

await tts.stream({
text: "This is a longer text, which is being streamed chunk by chunk. Let's see how it works!",
onNext: async (chunk) => {
return new Promise((resolve) => {
const buffer = ctx.createBuffer(1, chunk.length, 24000);
buffer.getChannelData(0).set(chunk);

const source = ctx.createBufferSource();
source.buffer = buffer;
source.connect(ctx.destination);
source.onEnded = () => resolve();
source.start();
});
},
});
};

return (
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
<Button title="Stream" onPress={generateStream} disabled={!tts.isReady} />
</View>
);
}

Synthesis from Phonemes

If you already have a phoneme string obtained from an external source (e.g. the Python phonemizer library, espeak-ng, or any custom phonemizer), you can use forwardFromPhonemes or streamFromPhonemes to synthesize audio directly, skipping the phoneme generation stage.

import React from 'react';
import { Button, View } from 'react-native';
import {
useTextToSpeech,
KOKORO_MEDIUM,
KOKORO_VOICE_AF_HEART,
} from 'react-native-executorch';

export default function App() {
const tts = useTextToSpeech({
model: KOKORO_MEDIUM,
voice: KOKORO_VOICE_AF_HEART,
});

const synthesizePhonemes = async () => {
// Example phonemes for "Hello"
const audioData = await tts.forwardFromPhonemes({
phonemes:
'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
});

// ... process or play audioData ...
};

return (
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
<Button
title="Synthesize Phonemes"
onPress={synthesizePhonemes}
disabled={!tts.isReady}
/>
</View>
);
}

Supported models

ModelLanguage
KokoroEnglish