Skip to main content
Version: Next

useVAD

Voice Activity Detection (VAD) is the task of analyzing an audio signal to identify time segments containing human speech, separating them from non-speech sections like silence and background noise.

caution

It is recommended to use models provided by us, which are available at our Hugging Face repository. You can also use constants shipped with our library.

Reference

You can obtain waveform from audio in any way most suitable to you, however in the snippet below we utilize react-native-audio-api library to process a .mp3 file.

import { useVAD, FSMN_VAD } from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';
import * as FileSystem from 'expo-file-system';

const model = useVAD({
model: FSMN_VAD,
});

const { uri } = await FileSystem.downloadAsync(
'https://some-audio-url.com/file.mp3',
FileSystem.cacheDirectory + 'audio_file'
);

const audioContext = new AudioContext({ sampleRate: 16000 });
const decodedAudioData = await audioContext.decodeAudioDataSource(uri);
const audioBuffer = decodedAudioData.getChannelData(0);

try {
// NOTE: to obtain segments in seconds, you need to divide
// start / end of the segment by the sampling rate (16k)

const speechSegments = await model.forward(audioBuffer);
console.log(speechSegments);
} catch (error) {
console.error('Error during running VAD model', error);
}

Arguments

model - Object containing the model source.

  • modelSource - A string that specifies the location of the model binary.

preventLoad? - Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

For more information on loading resources, take a look at loading models page.

Returns

FieldTypeDescription
forward(waveform: Float32Array) => Promise<{Segment[]}>Executes the model's forward pass, where input array should be a waveform at 16kHz. Returns a promise containing an array of Segment objects.
errorstring | nullContains the error message if the model failed to load.
isGeneratingbooleanIndicates whether the model is currently processing an inference.
isReadybooleanIndicates whether the model has successfully loaded and is ready for inference.
downloadProgressnumberRepresents the download progress as a value between 0 and 1.

Type definitions

interface Segment {
start: number;
end: number;
}

Running the model

Before running the model's forward method, make sure to extract the audio waveform you want to process. You'll need to handle this step yourself, ensuring the audio is sampled at 16 kHz. Once you have the waveform, pass it as an argument to the forward method. The method returns a promise that resolves to the array of detected speech segments.

info

Timestamps in returned speech segments, correspond to indices of input array (waveform).

Example

import React from 'react';
import { Button, Text, SafeAreaView } from 'react-native';
import { useVAD, FSMN_VAD } from 'react-native-executorch';
import { AudioContext } from 'react-native-audio-api';
import * as FileSystem from 'expo-file-system';

export default function App() {
const model = useVAD({
model: FSMN_VAD,
});

const audioURL = 'https://some-audio-url.com/file.mp3';

const handleAudio = async () => {
if (!model) {
console.error('VAD model is not loaded yet.');
return;
}

console.log('Processing URL:', audioURL);

try {
const { uri } = await FileSystem.downloadAsync(
audioURL,
FileSystem.cacheDirectory + 'vad_example.tmp'
);

const audioContext = new AudioContext({ sampleRate: 16000 });
const originalDecodedBuffer =
await audioContext.decodeAudioDataSource(uri);
const originalChannelData = originalDecodedBuffer.getChannelData(0);

const segments = await model.forward(originalChannelData);
if (segments.length === 0) {
console.log('No speech segments were found.');
return;
}
console.log(`Found ${segments.length} speech segments.`);

const totalLength = segments.reduce(
(sum, seg) => sum + (seg.end - seg.start),
0
);
const newAudioBuffer = audioContext.createBuffer(
1, // Mono
totalLength,
originalDecodedBuffer.sampleRate
);
const newChannelData = newAudioBuffer.getChannelData(0);

let offset = 0;
for (const segment of segments) {
const slice = originalChannelData.subarray(segment.start, segment.end);
newChannelData.set(slice, offset);
offset += slice.length;
}

// Play the processed audio
const source = audioContext.createBufferSource();
source.buffer = newAudioBuffer;
source.connect(audioContext.destination);
source.start();
} catch (error) {
console.error('Error processing audio data:', error);
}
};

return (
<SafeAreaView>
<Text>
Press the button to process and play speech from a sample file.
</Text>
<Button onPress={handleAudio} title="Run VAD Example" />
</SafeAreaView>
);
}

Supported models

Benchmarks

Model size

ModelXNNPACK [MB]
FSMN_VAD1.83

Memory usage

ModelAndroid (XNNPACK) [MB]iOS (XNNPACK) [MB]
FSMN_VAD9745,9

Inference time

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

Inference time were measured on a 60s audio, that can be found here.

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro Max (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
FSMN_VAD151171180109