Skip to main content
Version: 0.7.x

useLLM

React Native ExecuTorch supports a variety of LLMs (checkout our HuggingFace repository for model already converted to ExecuTorch format) including Llama 3.2. Before getting started, you’ll need to obtain the .pte binary—a serialized model, the tokenizer and tokenizer config JSON files. There are various ways to accomplish this:

  • For your convenience, it's best if you use models exported by us, you can get them from our HuggingFace repository. You can also use constants shipped with our library.
  • Follow the official tutorial made by ExecuTorch team to export arbitrary chosen LLM model.
danger

Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.

API Reference

Initializing

In order to load a model into the app, you need to run the following code:

import { useLLM, LLAMA3_2_1B } from 'react-native-executorch';

const llm = useLLM({ model: LLAMA3_2_1B });

The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various functions and properties for controlling the model. You can monitor the loading progress by checking the llm.downloadProgress and llm.isReady property, and if anything goes wrong, the llm.error property will contain the error message.

Arguments

useLLM takes LLMProps that consists of:

You need more details? Check the following resources:

  • For detailed information about useLLM arguments check this section: useLLM arguments.
  • For more information on loading resources, take a look at loading models page.
  • For available LLM models please check out the following list: LLM Models.

Returns

useLLM returns LLMType which provides:

For complete details, see the LLMType API Reference.

Functional vs managed

You can use functions returned from this hooks in two manners:

  1. Functional/pure - we will not keep any state for you. You'll need to keep conversation history and handle function calling yourself. Use generate and response. Note that you don't need to run configure to use those. Furthermore, chatConfig and toolsConfig will not have any effect on those functions.

  2. Managed/stateful - we will manage conversation state. Tool calls will be parsed and called automatically after passing appropriate callbacks. See more at managed LLM chat.

Functional way

Simple generation

To perform chat completion you can use the generate function. The response value is updated with each token as it's generated, and the function returns a promise that resolves to the complete response when generation finishes.

const llm = useLLM({ model: LLAMA3_2_1B });

const handleGenerate = async () => {
const chat: Message[] = [
{ role: 'system', content: 'You are a helpful assistant' },
{ role: 'user', content: 'Hi!' },
{ role: 'assistant', content: 'Hi!, how can I help you?' },
{ role: 'user', content: 'What is the meaning of life?' },
];

// Chat completion - returns the generated response
const response = await llm.generate(chat);
console.log('Complete response:', response);
};

return (
<View>
<Button onPress={handleGenerate} title="Generate!" />
<Text>{llm.response}</Text>
</View>
);

Interrupting the model

Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt, which will halt the model and update the response one last time.

There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isGenerating property.

warning

If you try to dismount the component using this hook while generation is still going on, it will result in crash. You'll need to interrupt the model first and wait until isGenerating is set to false.

Reasoning

Some models ship with a built-in "reasoning" or "thinking" mode, but this is model-specific, not a feature of our library. If the model you're using supports disabling reasoning, follow the instructions provided by the model authors. For example, Qwen 3 lets you disable reasoning by adding the /no_think suffix to your prompts - source.

Tool calling

Sometimes text processing capabilities of LLMs are not enough. That's when you may want to introduce tool calling (also called function calling). It allows model to use external tools to perform its tasks. The tools may be any arbitrary function that you want your model to run. It may retrieve some data from 3rd party API. It may do an action inside an app like pressing buttons or filling forms, or it may use system APIs to interact with your phone (turning on flashlight, adding events to your calendar, changing volume etc.).

const TOOL_DEFINITIONS: LLMTool[] = [
{
name: 'get_weather',
description: 'Get/check weather in given location.',
parameters: {
type: 'dict',
properties: {
location: {
type: 'string',
description: 'Location where user wants to check weather',
},
},
required: ['location'],
},
},
];

const llm = useLLM({ model: HAMMER2_1_1_5B });

const handleGenerate = () => {
const chat: Message[] = [
{
role: 'system',
content: `You are a helpful assistant. Current time and date: ${new Date().toString()}`,
},
{
role: 'user',
content: `Hi, what's the weather like in Cracow right now?`,
},
];

// Chat completion
llm.generate(chat, TOOL_DEFINITIONS);
};

useEffect(() => {
// Parse response and call tools accordingly
// ...
}, [llm.response]);

return (
<View>
<Button onPress={handleGenerate} title="Generate!" />
<Text>{llm.response}</Text>
</View>
);

Managed LLM Chat

Configuring the model

To configure model (i.e. change system prompt, load initial conversation history or manage tool calling, set generation settings) you can use configure method. chatConfig and toolsConfig is only applied to managed chats i.e. when using sendMessage (see: Functional vs managed) It accepts object with following fields:

  • chatConfig - Object configuring chat management that contains:

    • systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator".

    • initialMessageHistory - Object that represent the conversation history. This can be used to provide initial context to the model.

    • contextWindowLength - The number of messages from the current conversation that the model will use to generate a response. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.

  • toolsConfig - Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:

    • tools - List of objects defining tools.

    • executeToolCallback - Function that accepts ToolCall, executes tool and returns the string to model.

    • displayToolCalls - If set to true, JSON tool calls will be displayed in chat. If false, only answers will be displayed.

  • generationConfig - Object configuring generation settings with following properties:

    • outputTokenBatchSize - Soft upper limit on the number of tokens in each token batch (in certain cases there can be more tokens in given batch, i.e. when the batch would end with special emoji join character).

    • batchTimeInterval - Upper limit on the time interval between consecutive token batches.

    • temperature - Scales output logits by the inverse of temperature. Controls the randomness / creativity of text generation.

    • topp - Only samples from the smallest set of tokens whose cumulative probability exceeds topp.

Sending a message

In order to send a message to the model, one can use the following code:

const llm = useLLM({ model: LLAMA3_2_1B });

const send = () => {
const message = 'Hi, who are you?';
llm.sendMessage(message);
};

return <Button onPress={send} title="Generate!" />;

Accessing conversation history

Behind the scenes, tokens are generated one by one, and the response property is updated with each token as it’s created. If you want to get entire conversation you can use messageHistory field:

return (
<View>
{llm.messageHistory.map((message) => (
<Text>{message.content}</Text>
))}
</View>
);

Tool calling example

const TOOL_DEFINITIONS: LLMTool[] = [
{
name: 'get_weather',
description: 'Get/check weather in given location.',
parameters: {
type: 'dict',
properties: {
location: {
type: 'string',
description: 'Location where user wants to check weather',
},
},
required: ['location'],
},
},
];

const llm = useLLM({ model: HAMMER2_1_1_5B });

useEffect(() => {
llm.configure({
chatConfig: {
systemPrompt: `You are helpful assistant. Current time and date: ${new Date().toString()}`,
},
toolsConfig: {
tools: TOOL_DEFINITIONS,
executeToolCallback: async (call) => {
if (call.toolName === 'get_weather') {
console.log('Checking weather!');
// perform call to weather API
// ...
const mockResults = 'Weather is great!';
return mockResults;
}
return null;
},
displayToolCalls: true,
},
});
}, []);

const send = () => {
const message = `Hi, what's the weather like in Cracow right now?`;
llm.sendMessage(message);
};

return (
<View>
<Button onPress={send} title="Generate!" />
<Text>{llm.response}</Text>
</View>
);

Structured output example

import { Schema } from 'jsonschema';

const responseSchema: Schema = {
properties: {
username: {
type: 'string',
description: 'Name of user, that is asking a question.',
},
question: {
type: 'string',
description: 'Question that user asks.',
},
bid: {
type: 'number',
description: 'Amount of money, that user offers.',
},
currency: {
type: 'string',
description: 'Currency of offer.',
},
},
required: ['username', 'bid'],
type: 'object',
};

// alternatively use Zod
import * as z from 'zod/v4';
const responseSchemaWithZod = z.object({
username: z
.string()
.meta({ description: 'Name of user, that is asking a question.' }),
question: z.optional(
z.string().meta({ description: 'Question that user asks.' })
),
bid: z.number().meta({ description: 'Amount of money, that user offers.' }),
currency: z.optional(z.string().meta({ description: 'Currency of offer.' })),
});

const llm = useLLM({ model: QWEN3_4B_QUANTIZED });

useEffect(() => {
const formattingInstructions = getStructuredOutputPrompt(responseSchema);
// alternatively pass schema defined with Zod
// const formattingInstructions = getStructuredOutputPrompt(responseSchemaWithZod);

// Some extra prompting to improve quality of response.
const prompt = `Your goal is to parse user's messages and return them in JSON format. Don't respond to user. Simply return JSON with user's question parsed. \n${formattingInstructions}\n /no_think`;

llm.configure({
chatConfig: {
systemPrompt: prompt,
},
});
}, []);

useEffect(() => {
const lastMessage = llm.messageHistory.at(-1);
if (!llm.isGenerating && lastMessage?.role === 'assistant') {
try {
const formattedOutput = fixAndValidateStructuredOutput(
lastMessage.content,
responseSchemaWithZod
);
// Zod will allow you to correctly type output
const formattedOutputWithZod = fixAndValidateStructuredOutput(
lastMessage.content,
responseSchema
);
console.log('Formatted output:', formattedOutput, formattedOutputWithZod);
} catch (e) {
console.log(
"Error parsing output and/or output doesn't match required schema!",
e
);
}
}
}, [llm.messageHistory, llm.isGenerating]);

const send = () => {
const message = `I'm John. Is this product damaged? I can give you $100 for this.`;
llm.sendMessage(message);
};

return (
<View>
<Button onPress={send} title="Generate!" />
<Text>{llm.response}</Text>
</View>
);

The response should include JSON:

{
"username": "John",
"question": "Is this product damaged?",
"bid": 100,
"currency": "USD"
}

Token Batching

Depending on selected model and the user's device generation speed can be above 60 tokens per second. If the tokenCallback from LLMModule, which is used under the hood, triggers rerenders and is invoked on every single token it can significantly decrease the app's performance. To alleviate this and help improve performance we've implemented token batching. To configure this you need to call configure method and pass generationConfig. You can check what you can configure Configuring the Model. They set the size of the batch before tokens are emitted and the maximum time interval between consecutive batches respectively. Each batch is emitted if either timeInterval elapses since last batch or countInterval number of tokens are generated. This allows for smooth generation even if model lags during generation. Default parameters are set to 10 tokens and 80ms for time interval (~12 batches per second).

Available models

Model FamilySizesQuantized
Hammer 2.10.5B, 1.5B, 3B
Qwen 2.50.5B, 1.5B, 3B
Qwen 30.6B, 1.7B, 4B
Phi 4 Mini4B
SmolLM 2135M, 360M, 1.7B
LLaMA 3.21B, 3B