Skip to main content

Running LLMs

React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:

  • For your convienience, it's best if you use models exported by us, you can get them from our HuggingFace repository. You can also use constants shipped with our library.
  • If you want to export model by yourself, you can use a Docker image that we've prepared. To see how it works, check out exporting Llama
  • Follow the official tutorial made by ExecuTorch team to build the model and tokenizer yourself

Initializing

In order to load a model into the app, you need to run the following code:

import { useLLM, LLAMA3_2_1B } from 'react-native-executorch';

const llama = useLLM({
modelSource: LLAMA3_2_1B,
tokenizer: require('../assets/tokenizer.bin'),
contextWindowLength: 3,
});

The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various methods and properties for controlling the model. You can monitor the loading progress by checking the llama.downloadProgress and llama.isReady property, and if anything goes wrong, the llama.error property will contain the error message.

danger

Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.

caution

Given computational constraints, our architecture is designed to support only one instance of the model runner at the time. Consequently, this means you can have only one active component leveraging useLLM concurrently.

Arguments

modelSource - A string that specifies the location of the model binary. For more information, take a look at loading models section.

tokenizer - URL to the binary file which contains the tokenizer

contextWindowLength - The number of messages from the current conversation that the model will use to generate a response. The higher the number, the more context the model will have. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.

systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator"

Returns

FieldTypeDescription
generate(input: string) => Promise<void>Function to start generating a response with the given input string.
responsestringState of the generated response. This field is updated with each token generated by the model
errorstring | nullContains the error message if the model failed to load
isGeneratingbooleanIndicates whether the model is currently generating a response
interrupt() => voidFunction to interrupt the current inference
isReadybooleanIndicates whether the model is ready
downloadProgressnumberRepresents the download progress as a value between 0 and 1, indicating the extent of the model file retrieval.

Sending a message

In order to send a message to the model, one can use the following code:

const llama = useLLM(
modelSource: LLAMA3_2_1B,
tokenizer: require('../assets/tokenizer.bin'),
);

...
const message = 'Hi, who are you?';
await llama.generate(message);
...

Listening for the response

As you might've noticed, there is no return value from the runInference function. Instead, the .response field of the model is updated with each token. This is how you can render the response of the model:

...
return (
<Text>{llama.response}</Text>
)

Behind the scenes, tokens are generated one by one, and the response property is updated with each token as it’s created. This means that the text component will re-render whenever llama.response gets updated.

Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt(), which will halt the model and append the current response to its internal conversation state.

There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isTokenBeingGenerated property.