Version: 0.4.x

useLLM

React Native ExecuTorch supports a variety of LLMs (checkout our HuggingFace repository for model already converted to ExecuTorch format) including Llama 3.2. Before getting started, you’ll need to obtain the .pte binary—a serialized model, the tokenizer and tokenizer config JSON files. There are various ways to accomplish this:

For your convenience, it's best if you use models exported by us, you can get them from our HuggingFace repository. You can also use constants shipped with our library.
Follow the official tutorial made by ExecuTorch team to build the model and tokenizer yourself.

danger

Lower-end devices might not be able to fit LLMs into memory. We recommend using quantized models to reduce the memory footprint.

caution

Given computational constraints, our architecture is designed to support only one instance of the model runner at the time. Consequently, this means you can have only one active component leveraging useLLM concurrently.

Initializing

In order to load a model into the app, you need to run the following code:

import {
  useLLM,
  LLAMA3_2_1B,
  LLAMA3_2_TOKENIZER,
  LLAMA3_2_TOKENIZER_CONFIG,
} from 'react-native-executorch';

const llm = useLLM({
  modelSource: LLAMA3_2_1B,
  tokenizerSource: LLAMA3_2_TOKENIZER,
  tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});

The code snippet above fetches the model from the specified URL, loads it into memory, and returns an object with various functions and properties for controlling the model. You can monitor the loading progress by checking the llm.downloadProgress and llm.isReady property, and if anything goes wrong, the llm.error property will contain the error message.

Arguments

modelSource - ResourceSource that specifies the location of the model binary. For more information, take a look at loading models section.

tokenizerSource - ResourceSource pointing to the JSON file which contains the tokenizer.

tokenizerConfigSource - ResourceSource pointing to the JSON file which contains the tokenizer config.

preventLoad? - Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

Returns

Field	Type	Description
`generate()`	`(messages: Message[], tools?: LLMTool[]) => Promise<void>`	Runs model to complete chat passed in `messages` argument. It doesn't manage conversation context.
`interrupt()`	`() => void`	Function to interrupt the current inference.
`response`	`string`	State of the generated response. This field is updated with each token generated by the model.
`token`	`string`	The most recently generated token.
`isReady`	`boolean`	Indicates whether the model is ready.
`isGenerating`	`boolean`	Indicates whether the model is currently generating a response.
`downloadProgress`	`number`	Represents the download progress as a value between 0 and 1, indicating the extent of the model file retrieval.
`error`	`string \| null`	Contains the error message if the model failed to load.
`configure`	`({ chatConfig?: Partial<ChatConfig>, toolsConfig?: ToolsConfig }) => void`	Configures chat and tool calling. See more details in configuring the model.
`sendMessage`	`(message: string) => Promise<void>`	Function to add user message to conversation. After model responds, `messageHistory` will be updated with both user message and model response.
`deleteMessage`	`(index: number) => void`	Deletes all messages starting with message on `index` position. After deletion `messageHistory` will be updated.
`messageHistory`	`Message[]`	History containing all messages in conversation. This field is updated after model responds to `sendMessage`.

Type definitions

const useLLM: ({
  modelSource,
  tokenizerSource,
  tokenizerConfigSource,
  preventLoad = false,
}: {
  modelSource: ResourceSource;
  tokenizerSource: ResourceSource;
  tokenizerConfigSource: ResourceSource;
  preventLoad?: boolean;
}) => LLMType;

interface LLMType {
  messageHistory: Message[];
  response: string;
  token: string;
  isReady: boolean;
  isGenerating: boolean;
  downloadProgress: number;
  error: string | null;
  configure: ({
    chatConfig,
    toolsConfig,
  }: {
    chatConfig?: Partial<ChatConfig>;
    toolsConfig?: ToolsConfig;
  }) => void;
  generate: (messages: Message[], tools?: LLMTool[]) => Promise<void>;
  sendMessage: (message: string) => Promise<void>;
  deleteMessage: (index: number) => void;
  interrupt: () => void;
}

type ResourceSource = string | number | object;

type MessageRole = 'user' | 'assistant' | 'system';

interface Message {
  role: MessageRole;
  content: string;
}
interface ChatConfig {
  initialMessageHistory: Message[];
  contextWindowLength: number;
  systemPrompt: string;
}

// tool calling
interface ToolsConfig {
  tools: LLMTool[];
  executeToolCallback: (call: ToolCall) => Promise<string | null>;
  displayToolCalls?: boolean;
}

interface ToolCall {
  toolName: string;
  arguments: Object;
}

type LLMTool = Object;

Functional vs managed

You can use functions returned from this hooks in two manners:

Functional/pure - we will not keep any state for you. You'll need to keep conversation history and handle function calling yourself. Use generate (and rarely forward) and response. Note that you don't need to run configure to use those. Furthermore, it will not have any effect on those functions.
Managed/stateful - we will manage conversation state. Tool calls will be parsed and called automatically after passing appropriate callbacks. See more at managed LLM chat.

Functional way

Simple generation

To perform chat completion you can use the generate function. There is no return value. Instead, the response value is updated with each token.

const llm = useLLM({
  modelSource: LLAMA3_2_1B,
  tokenizerSource: LLAMA3_2_TOKENIZER,
  tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});

const handleGenerate = () => {
  const chat = [
    { role: 'system', content: 'You are a helpful assistant' },
    { role: 'user', content: 'Hi!' },
    { role: 'assistant', content: 'Hi!, how can I help you?'},
    { role: 'user', content: 'What is the meaning of life?' },
  ];

  // Chat completion
  llm.generate(chat);
};

return (
  <Button onPress={handleGenerate} title="Generate!" />
  <Text>{llm.response}</Text>
)

Interrupting the model

Sometimes, you might want to stop the model while it’s generating. To do this, you can use interrupt(), which will halt the model and update the response one last time.

There are also cases when you need to check if tokens are being generated, such as to conditionally render a stop button. We’ve made this easy with the isGenerating property.

caution

If you try to dismount the component using this hook while generation is still going on, it will result in crash. You'll need to interrupt the model first and wait until isGenerating is set to false.

Tool calling

Sometimes text processing capabilities of LLMs are not enough. That's when you may want to introduce tool calling (also called function calling). It allows model to use external tools to perform its tasks. The tools may be any arbitrary function that you want your model to run. It may retrieve some data from 3rd party API. It may do an action inside an app like pressing buttons or filling forms, or it may use system APIs to interact with your phone (turning on flashlight, adding events to your calendar, changing volume etc.).

const TOOL_DEFINITIONS: LLMTool[] = [
  {
    name: 'get_weather',
    description: 'Get/check weather in given location.',
    parameters: {
      type: 'dict',
      properties: {
        location: {
          type: 'string',
          description: 'Location where user wants to check weather',
        },
      },
      required: ['location'],
    },
  },
];

const llm = useLLM({
  modelSource: HAMMER2_1_1_5B,
  tokenizerSource: HAMMER2_1_1_5B_TOKENIZER,
  tokenizerConfigSource: HAMMER2_1_1_5B_TOKENIZER_CONFIG,
});

const handleGenerate = () => {
  const chat = [
    {
      role: 'system',
      content: `You are a helpful assistant. Current time and date: ${new Date().toString()}`,
    },
    {
      role: 'user',
      content: `Hi, what's the weather like in Cracow right now?`,
    },
  ];

  // Chat completion
  llm.generate(chat, TOOL_DEFINITIONS);
};

useEffect(() => {
  // Parse response and call tools accordingly
  // ...
}, [llm.response])

return (
  <Button onPress={handleGenerate} title="Generate!" />
  <Text>{llm.response}</Text>
)

Managed LLM Chat

Configuring the model

To configure model (i.e. change system prompt, load initial conversation history or manage tool calling) you can use configure function. It accepts object with following fields:

chatConfig - Object configuring chat management, contains following properties:

systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator".
initialMessageHistory - An array of Message objects that represent the conversation history. This can be used to provide initial context to the model.
contextWindowLength - The number of messages from the current conversation that the model will use to generate a response. The higher the number, the more context the model will have. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.

toolsConfig - Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:

tools - List of objects defining tools.
executeToolCallback - Function that accepts ToolCall, executes tool and returns the string to model.
displayToolCalls - If set to true, JSON tool calls will be displayed in chat. If false, only answers will be displayed.

Sending a message

In order to send a message to the model, one can use the following code:

const llm = useLLM({
  modelSource: LLAMA3_2_1B,
  tokenizerSource: LLAMA3_2_TOKENIZER,
  tokenizerConfigSource: LLAMA3_2_TOKENIZER_CONFIG,
});

const send = () => {
  const message = 'Hi, who are you?';
  llm.sendMessage(message);
};

Accessing conversation history

Behind the scenes, tokens are generated one by one, and the response property is updated with each token as it’s created. If you want to get entire conversation you can use messageHistory field:

return (
  <View>
    {llm.messageHistory.map((message) => (
      <Text>{message.content}</Text>
    ))}
  </View>
)

Tool calling example

const TOOL_DEFINITIONS: LLMTool[] = [
  {
    name: 'get_weather',
    description: 'Get/check weather in given location.',
    parameters: {
      type: 'dict',
      properties: {
        location: {
          type: 'string',
          description: 'Location where user wants to check weather',
        },
      },
      required: ['location'],
    },
  },
];

const llm = useLLM({
  modelSource: HAMMER2_1_1_5B,
  tokenizerSource: HAMMER2_1_1_5B_TOKENIZER,
  tokenizerConfigSource: HAMMER2_1_1_5B_TOKENIZER_CONFIG,
});

useEffect(() => {
    llm.configure({
      chatConfig: {
        systemPrompt: `You are helpful assistant. Current time and date: ${new Date().toString()}`,
      },
      toolsConfig: {
        tools: TOOL_DEFINITIONS,
        executeToolCallback: async (call) => {
          if (call.toolName == 'get_weather') {
            console.log('Checking weather!');
            // perform call to weather API
            ...
            const mockResults = 'Weather is great!';
            return mockResult;
          }
          return null;
        },
        displayToolCalls: true,
      },
    });
}, []);

const send = () => {
  const message = `Hi, what's the weather like in Cracow right now?`;
  llm.sendMessage(message);
};

Available models

Model Family	Sizes	Quantized
Hammer 2.1	0.5B, 1.5B, 3B	✅
Qwen 2.5	0.5B, 1.5B, 3B	✅
Qwen 3	0.6B, 1.7B, 4B	✅
Phi 4 Mini	4B	✅
SmolLM 2	135M, 360M, 1.7B	✅
LLaMA 3.2	1B, 3B	✅

Benchmarks

Model size

Model	XNNPACK [GB]
LLAMA3_2_1B	2.47
LLAMA3_2_1B_SPINQUANT	1.14
LLAMA3_2_1B_QLORA	1.18
LLAMA3_2_3B	6.43
LLAMA3_2_3B_SPINQUANT	2.55
LLAMA3_2_3B_QLORA	2.65

Memory usage

Model	Android (XNNPACK) [GB]	iOS (XNNPACK) [GB]
LLAMA3_2_1B	3.2	3.1
LLAMA3_2_1B_SPINQUANT	1.9	2
LLAMA3_2_1B_QLORA	2.2	2.5
LLAMA3_2_3B	7.1	7.3
LLAMA3_2_3B_SPINQUANT	3.7	3.8
LLAMA3_2_3B_QLORA	4	4.1

Inference time

Model	iPhone 16 Pro (XNNPACK) [tokens/s]	iPhone 13 Pro (XNNPACK) [tokens/s]	iPhone SE 3 (XNNPACK) [tokens/s]	Samsung Galaxy S24 (XNNPACK) [tokens/s]	OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B	16.1	11.4	❌	15.6	19.3
LLAMA3_2_1B_SPINQUANT	40.6	16.7	16.5	40.3	48.2
LLAMA3_2_1B_QLORA	31.8	11.4	11.2	37.3	44.4
LLAMA3_2_3B	❌	❌	❌	❌	7.1
LLAMA3_2_3B_SPINQUANT	17.2	8.2	❌	16.2	19.4
LLAMA3_2_3B_QLORA	14.5	❌	❌	14.8	18.1

❌ - Insufficient RAM.

Initializing​

Arguments​

Returns​

Functional vs managed​

Functional way​

Simple generation​

Interrupting the model​

Tool calling​

Managed LLM Chat​

Configuring the model​

Sending a message​

Accessing conversation history​

Tool calling example​

Available models​

Benchmarks​

Model size​

Memory usage​

Inference time​