Running LLMs
React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
- For your convienience, it's best if you use models exported by us, you can get them from our HuggingFace repository. You can also use constants shipped with our library.
- If you want to export model by yourself, you can use a Docker image that we've prepared. To see how it works, check out exporting Llama
- Follow the official tutorial made by ExecuTorch team to build the model and tokenizer yourself
Initializing
In order to load a model into the app, you need to run the following code:
import { useLLM, LLAMA3_2_1B } from 'react-native-executorch';
const llama = useLLM({
modelSource: LLAMA3_2_1B,
tokenizer: require('../assets/tokenizer.bin'),
contextWindowLength: 3,
});
The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various methods and properties for controlling the model. You can monitor the loading progress by checking the llama.downloadProgress
and llama.isReady
property, and if anything goes wrong, the llama.error
property will contain the error message.
Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.
Given computational constraints, our architecture is designed to support only one instance of the model runner at the time. Consequently, this means you can have only one active component leveraging useLLM
concurrently.
Arguments
modelSource
- A string that specifies the location of the model binary. For more information, take a look at loading models section.
tokenizer
- URL to the binary file which contains the tokenizer
contextWindowLength
- The number of messages from the current conversation that the model will use to generate a response. The higher the number, the more context the model will have. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.
systemPrompt
- Often used to tell the model what is its purpose, for example - "Be a helpful translator"
Returns
Field | Type | Description |
---|---|---|
generate | (input: string) => Promise<void> | Function to start generating a response with the given input string. |
response | string | State of the generated response. This field is updated with each token generated by the model |
error | string | null | Contains the error message if the model failed to load |
isGenerating | boolean | Indicates whether the model is currently generating a response |
interrupt | () => void | Function to interrupt the current inference |
isReady | boolean | Indicates whether the model is ready |
downloadProgress | number | Represents the download progress as a value between 0 and 1, indicating the extent of the model file retrieval. |
Sending a message
In order to send a message to the model, one can use the following code:
const llama = useLLM(
modelSource: LLAMA3_2_1B,
tokenizer: require('../assets/tokenizer.bin'),
);
...
const message = 'Hi, who are you?';
await llama.generate(message);
...
Listening for the response
As you might've noticed, there is no return value from the runInference
function. Instead, the .response
field of the model is updated with each token.
This is how you can render the response of the model:
...
return (
<Text>{llama.response}</Text>
)
Behind the scenes, tokens are generated one by one, and the response property is updated with each token as it’s created. This means that the text component will re-render whenever llama.response gets updated.
Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt()
, which will halt the model and append the current response to its internal conversation state.
There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isTokenBeingGenerated
property.