Skip to main content
Version: 0.1.x

Running LLMs

React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:

  • For your convienience, it's best if you use models exported by us, you can get them from our hugging face repository. You can also use constants shipped with our library.
  • If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out exporting Llama
  • Follow the official tutorial made by ExecuTorch team to build the model and tokenizer yourself

Initializing

In order to load a model into the app, you need to run the following code:

import { useLLM, LLAMA3_2_1B_URL } from 'react-native-executorch';

const llama = useLLM({
modelSource: LLAMA3_2_1B_URL,
tokenizer: require('../assets/tokenizer.bin'),
contextWindowLength: 3,
});

The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various methods and properties for controlling the model. You can monitor the loading progress by checking the llama.downloadProgress and llama.isModelReady property, and if anything goes wrong, the llama.error property will contain the error message.

Danger

Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.

Caution

Given computational constraints, our architecture is designed to support only one instance of the model runner at the time. Consequently, this means you can have only one active component leveraging useLLM concurrently.

Arguments

modelSource - A string that specifies the location of the model binary. For more information, take a look at loading models section.

tokenizer - URL to the binary file which contains the tokenizer

contextWindowLength - The number of messages from the current conversation that the model will use to generate a response. The higher the number, the more context the model will have. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.

systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator"

Returns

FieldTypeDescription
generate(input: string) => Promise<void>Function to start generating a response with the given input string.
responsestringState of the generated response. This field is updated with each token generated by the model
errorstring | nullContains the error message if the model failed to load
isModelGeneratingbooleanIndicates whether the model is currently generating a response
interrupt() => voidFunction to interrupt the current inference
isModelReadybooleanIndicates whether the model is ready
downloadProgressnumberRepresents the download progress as a value between 0 and 1, indicating the extent of the model file retrieval.

Loading models

There are three different methods available for loading the model and tokenizer files, depending on their size and location.

1. Load from React-Native assets folder (For Files < 512MB)

modelSource: require('../assets/llama3_2.pte');

2. Load from Remote URL:

For files larger than 512MB or when you want to keep size of the app smaller, you can load the model from a remote URL (e.g. HuggingFace).

modelSource: 'https://.../llama3_2.pte';

3. Load from local file system:

If you prefer to delegate the process of obtaining and loading model and tokenizer files to the user, you can use the following method:

modelSource: 'file:://var/mobile/.../llama3_2.pte',
Info

The downloaded files are stored in documents directory of your application.

Sending a message

In order to send a message to the model, one can use the following code:

const llama = useLLM(
modelSource: LLAMA3_2_1B_URL,
tokenizer: require('../assets/tokenizer.bin'),
);

...
const message = 'Hi, who are you?';
await llama.generate(message);
...

Listening for the response

As you might've noticed, there is no return value from the runInference function. Instead, the .response field of the model is updated with each token. This is how you can render the response of the model:

...
return (
<Text>{llama.response}</Text>
)

Behind the scenes, tokens are generated one by one, and the response property is updated with each token as it’s created. This means that the text component will re-render whenever llama.response gets updated.

Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt(), which will halt the model and append the current response to its internal conversation state.

There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isTokenBeingGenerated property.