useLLM
React Native ExecuTorch supports a variety of LLMs (checkout our HuggingFace repository for model already converted to ExecuTorch format) including Llama 3.2. Before getting started, you’ll need to obtain the .pte binary—a serialized model, the tokenizer and tokenizer config JSON files. There are various ways to accomplish this:
- For your convenience, it's best if you use models exported by us, you can get them from our HuggingFace repository. You can also use constants shipped with our library.
- Follow the official tutorial made by ExecuTorch team to build the model and tokenizer yourself.
Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.
Given computational constraints, our architecture is designed to support only one instance of the model runner at the time. Consequently, this means you can have only one active component leveraging useLLM
concurrently.
Initializing
In order to load a model into the app, you need to run the following code:
import {
useLLM,
LLAMA3_2_1B,
LLAMA3_2_TOKENIZER,
LLAMA3_2_TOKENIZER_CONFIG,
} from 'react-native-executorch';
const llm = useLLM({
modelSource: LLAMA3_2_1B,
tokenizerSource: LLAMA3_2_TOKENIZER,
tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});
The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various functions and properties for controlling the model. You can monitor the loading progress by checking the llm.downloadProgress
and llm.isReady
property, and if anything goes wrong, the llm.error
property will contain the error message.
Arguments
modelSource
- ResourceSource
that specifies the location of the model binary. For more information, take a look at loading models section.
tokenizerSource
- ResourceSource
pointing to the JSON file which contains the tokenizer.
tokenizerConfigSource
- ResourceSource
pointing to the JSON file which contains the tokenizer config.
preventLoad?
- Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.
Returns
Field | Type | Description |
---|---|---|
generate() | (messages: Message[], tools?: LLMTool[]) => Promise<void> | Runs model to complete chat passed in messages argument. It doesn't manage conversation context. |
interrupt() | () => void | Function to interrupt the current inference. |
response | string | State of the generated response. This field is updated with each token generated by the model. |
isReady | boolean | Indicates whether the model is ready. |
isGenerating | boolean | Indicates whether the model is currently generating a response. |
downloadProgress | number | Represents the download progress as a value between 0 and 1, indicating the extent of the model file retrieval. |
error | string | null | Contains the error message if the model failed to load. |
configure | ({ chatConfig?: Partial<ChatConfig>, toolsConfig?: ToolsConfig }) => void | Configures chat and tool calling. See more details in configuring the model. |
sendMessage | (message: string, tools?: LLMTool[]) => Promise<void> | Function to add user message to conversation. After model responds, messageHistory will be updated with both user message and model response. |
deleteMessage | (index: number) => void | Deletes all messages starting with message on index position. After deletion messageHistory will be updated. |
messageHistory | Message[] | History containing all messages in conversation. This field is updated after model responds to sendMessage . |


Type definitions
const useLLM: ({
modelSource,
tokenizerSource,
tokenizerConfigSource,
preventLoad = false,
}: {
modelSource: ResourceSource;
tokenizerSource: ResourceSource;
tokenizerConfigSource: ResourceSource;
preventLoad?: boolean;
}) => LLMType;
interface LLMType {
messageHistory: Message[];
response: string;
isReady: boolean;
isGenerating: boolean;
downloadProgress: number;
error: string | null;
configure: ({
chatConfig,
toolsConfig,
}: {
chatConfig?: Partial<ChatConfig>;
toolsConfig?: ToolsConfig;
}) => void;
generate: (messages: Message[], tools?: LLMTool[]) => Promise<void>;
sendMessage: (message: string) => Promise<void>;
deleteMessage: (index: number) => void;
interrupt: () => void;
}
type ResourceSource = string | number | object;
type MessageRole = 'user' | 'assistant' | 'system';
interface Message {
role: MessageRole;
content: string;
}
interface ChatConfig {
initialMessageHistory: Message[];
contextWindowLength: number;
systemPrompt: string;
}
// tool calling
interface ToolsConfig {
tools: LLMTool[];
executeToolCallback: (call: ToolCall) => Promise<string | null>;
displayToolCalls?: boolean;
}
interface ToolCall {
toolName: string;
arguments: Object;
}
type LLMTool = Object;
Functional vs managed
You can use functions returned from this hooks in two manners:
-
Functional/pure - we will not keep any state for you. You'll need to keep conversation history and handle function calling yourself. Use
generate
(and rarelyforward
) andresponse
. Note that you don't need to runconfigure
to use those. Furthermore, it will not have any effect on those functions. -
Managed/stateful - we will manage conversation state. Tool calls will be parsed and called automatically after passing appropriate callbacks. See more at managed LLM chat.
Functional way
Simple generation
To perform chat completion you can use the generate
function. There is no return value. Instead, the response
value is updated with each token.
const llm = useLLM({
modelSource: LLAMA3_2_1B,
tokenizerSource: LLAMA3_2_TOKENIZER,
tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});
const handleGenerate = async () => {
const chat = [
{ role: 'system' content: 'You are a helpful assistant' },
{ role: 'user', content: 'Hi!' },
{ role: 'assistant', content: 'Hi!, how can I help you?'},
{ role: 'user', content: 'What is the meaning of life?' },
];
// Chat completion
await llm.generate(chat);
console.log('Llama says:', llm.response);
};
return (
<Text>{llm.response}</Text>
)
Interrupting the model
Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt()
, which will halt the model and update the response one last time.
There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isGenerating
property.
If you try to dismount the component using this hook while generation is still going on, it will result in crash.
You'll need to interrupt the model first and wait until isGenerating
is set to false.
Tool calling
Sometimes text processing capabilities of LLMs are not enough. That's when you may want to introduce tool calling (also called function calling). It allows model to use external tools to perform its tasks. The tools may be any arbitrary function that you want your model to run. It may retrieve some data from 3rd party API. It may do an action inside an app like pressing buttons or filling forms, or it may use system APIs to interact with your phone (turning on flashlight, adding events to your calendar, changing volume etc.).
const TOOL_DEFINITIONS: LLMTool[] = [
{
name: 'get_weather',
description: 'Get/check weather in given location.',
parameters: {
type: 'dict',
properties: {
location: {
type: 'string',
description: 'Location where user wants to check weather',
},
},
required: ['location'],
},
},
];
const llm = useLLM({
modelSource: HAMMER2_1_1_5B,
tokenizerSource: HAMMER2_1_1_5B_TOKENIZER,
tokenizerConfigSource: HAMMER2_1_1_5B_TOKENIZER_CONFIG,
});
const handleGenerate = async () => {
const chat = [
{ role: 'system' content: `You are a helpful assistant. Current time and date: ${new Date().toString()}` }
{ role: 'user', content: `Hi, what's the weather like in Cracow right now?` }
];
// Chat completion
await llm.generate(chat, TOOL_DEFINITIONS);
console.log('Hammer says:', llm.response);
// Parse response and call functions accordingly
};
Managed LLM Chat
Configuring the model
To configure model (i.e. change system prompt, load initial conversation history or manage tool calling) you can use
configure
function. It accepts object with following fields:
chatConfig
- Object configuring chat management, contains following properties:
-
systemPrompt
- Often used to tell the model what is its purpose, for example - "Be a helpful translator". -
initialMessageHistory
- An array ofMessage
objects that represent the conversation history. This can be used to provide initial context to the model. -
contextWindowLength
- The number of messages from the current conversation that the model will use to generate a response. The higher the number, the more context the model will have. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.
toolsConfig
- Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:
-
tools
- List of objects defining tools. -
executeToolCallback
- Function that acceptsToolCall
, executes tool and returns the string to model. -
displayToolCalls
- If set to true, JSON tool calls will be displayed in chat. If false, only answers will be displayed.
Sending a message
In order to send a message to the model, one can use the following code:
const llm = useLLM({
modelSource: LLAMA3_2_1B,
tokenizerSource: LLAMA3_2_TOKENIZER,
tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});
const send = async () => {
const message = 'Hi, who are you?';
await llm.sendMessage(message);
};
Accessing conversation history
Behind the scenes, tokens are generated one by one, and the response
property is updated with each token as it’s created.
If you want to get entire conversation you can use messageHistory
field:
return (
<View>
{llm.messageHistory.map((message) => (
<Text>{message.content}</Text>
))}
</View>
)
Tool calling example
const TOOL_DEFINITIONS: LLMTool[] = [
{
name: 'get_weather',
description: 'Get/check weather in given location.',
parameters: {
type: 'dict',
properties: {
location: {
type: 'string',
description: 'Location where user wants to check weather',
},
},
required: ['location'],
},
},
];
const llm = useLLM({
modelSource: HAMMER2_1_1_5B,
tokenizerSource: HAMMER2_1_1_5B_TOKENIZER,
tokenizerConfigSource: HAMMER2_1_1_5B_TOKENIZER_CONFIG,
});
useEffect(() => {
llm.configure({
chatConfig: {
systemPrompt: `You are helpful assistant. Current time and date: ${new Date().toString()}`,
},
toolsConfig: {
tools: TOOL_DEFINITIONS,
executeToolCallback: async (call) => {
if (call.toolName == 'get_weather') {
console.log('Checking weather!');
// perform call to weather API
...
const mockResults = 'Weather is great!';
return mockResult;
}
return null;
},
displayToolCalls: true,
},
});
}, []);
const send = async () => {
const message = `Hi, what's the weather like in Cracow right now?`;
await llm.sendMessage(message);
};
Available models
Model Family | Sizes | Quantized |
---|---|---|
Hammer 2.1 | 0.5B, 1.5B, 3B | ✅ |
Qwen 2.5 | 0.5B, 1.5B, 3B | ✅ |
Qwen 3 | 0.6B, 1.7B, 4B | ✅ |
Phi 4 Mini | 4B | ✅ |
SmolLM 2 | 135M, 360M, 1.7B | ✅ |
LLaMA 3.2 | 1B, 3B | ✅ |
Benchmarks
Model size
Model | XNNPACK [GB] |
---|---|
LLAMA3_2_1B | 2.47 |
LLAMA3_2_1B_SPINQUANT | 1.14 |
LLAMA3_2_1B_QLORA | 1.18 |
LLAMA3_2_3B | 6.43 |
LLAMA3_2_3B_SPINQUANT | 2.55 |
LLAMA3_2_3B_QLORA | 2.65 |
Memory usage
Model | Android (XNNPACK) [GB] | iOS (XNNPACK) [GB] |
---|---|---|
LLAMA3_2_1B | 3.2 | 3.1 |
LLAMA3_2_1B_SPINQUANT | 1.9 | 2 |
LLAMA3_2_1B_QLORA | 2.2 | 2.5 |
LLAMA3_2_3B | 7.1 | 7.3 |
LLAMA3_2_3B_SPINQUANT | 3.7 | 3.8 |
LLAMA3_2_3B_QLORA | 4 | 4.1 |
Inference time
Model | iPhone 16 Pro (XNNPACK) [tokens/s] | iPhone 13 Pro (XNNPACK) [tokens/s] | iPhone SE 3 (XNNPACK) [tokens/s] | Samsung Galaxy S24 (XNNPACK) [tokens/s] | OnePlus 12 (XNNPACK) [tokens/s] |
---|---|---|---|---|---|
LLAMA3_2_1B | 16.1 | 11.4 | ❌ | 15.6 | 19.3 |
LLAMA3_2_1B_SPINQUANT | 40.6 | 16.7 | 16.5 | 40.3 | 48.2 |
LLAMA3_2_1B_QLORA | 31.8 | 11.4 | 11.2 | 37.3 | 44.4 |
LLAMA3_2_3B | ❌ | ❌ | ❌ | ❌ | 7.1 |
LLAMA3_2_3B_SPINQUANT | 17.2 | 8.2 | ❌ | 16.2 | 19.4 |
LLAMA3_2_3B_QLORA | 14.5 | ❌ | ❌ | 14.8 | 18.1 |
❌ - Insufficient RAM.