Version: Next

LLMModule

TypeScript API implementation of the useLLM hook.

API Reference

For detailed API Reference for LLMModule see: LLMModule API Reference.
For all LLM models available out-of-the-box in React Native ExecuTorch see: LLM Models.
For useful LLM utility functionalities please refer to the following link: LLM Utility Functionalities.

High Level Overview

import { models, LLMModule } from 'react-native-executorch';
// Creating an instance and loading the model
const llm = await LLMModule.fromModelName(
  models.llm.lfm2_5_1_2b_instruct(),
  (progress) => console.log(progress),
  (token) => console.log(token),
  (messages) => console.log(messages)
);

// Running the model - returns the generated response
const response = await llm.sendMessage('Hello, World!');
console.log('Response:', response);

// Interrupting the model (to actually interrupt the generation it has to be called when sendMessage or generate is running)
llm.interrupt();

// Deleting the model from memory
llm.delete();

Methods

All methods of LLMModule are explained in details here: LLMModule API Reference.

Loading the model

Use the static fromModelName factory method:

const llm = await LLMModule.fromModelName(
  models.llm.lfm2_5_1_2b_instruct(), // model config constant
  onDownloadProgress, // optional, progress 0–1
  tokenCallback, // optional, called on every token
  messageHistoryCallback // optional, called when generation finishes
);

The model config object contains modelSource, tokenizerSource, tokenizerConfigSource, and optional capabilities. Pass one of the built-in constants (e.g. LFM2_5_1_2B_INSTRUCT) or construct it manually.

This method returns a promise resolving to an LLMModule instance.

For more information on loading resources, take a look at loading models page.

Listening for download progress

To subscribe to the download progress event, you can pass the onDownloadProgress callback as the second argument to fromModelName. This function is called whenever the download progress changes.

Running the model

To run the model, you can use generate method. It allows you to pass chat messages and receive completion from the model. It doesn't provide any message history management.

Alternatively in managed chat (see: Functional vs managed), you can use the sendMessage method. It accepts the user message and returns a promise that resolves to the generated response. Additionally, it will call messageHistoryCallback with the updated message history containing both user message and model response.

If you need raw model access without any wrappers, you can use forward. It provides direct access to the model, so the input string is passed straight into the model and returns the generated response. It may be useful to work with models that aren't finetuned for chat completions. If you're not sure what are implications of that (e.g. that you have to include special model tokens), you're better off with sendMessage.

Listening for generated tokens

To subscribe to the token generation event, you can pass tokenCallback or messageHistoryCallback functions to the constructor. tokenCallback is called on every token and contains only the most recent token. messageHistoryCallback is called whenever model finishes generation and contains all message history including user's and model's last messages.

Interrupting the model

In order to interrupt the model, you can use the interrupt method.

Token Batching

Depending on selected model and the user's device generation speed can be above 60 tokens per second. If the tokenCallback triggers rerenders and is invoked on every single token it can significantly decrease the app's performance. To alleviate this and help improve performance we've implemented token batching. To configure this you need to call configure method and pass generationConfig. In the next section, there are listed what you can tweak with this config.

Configuring the model

To configure model (i.e. change system prompt, load initial conversation history or manage tool calling, set generation settings) you can use configure method. chatConfig and toolsConfig is only applied to managed chats i.e. when using sendMessage (see: Functional vs managed) It accepts object with following fields:

chatConfig - Object configuring chat management that contains:
- systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator".
- initialMessageHistory - Object that represent the conversation history. This can be used to provide initial context to the model.
- contextStrategy - Object implementing ContextStrategy interface used to manage conversation context, including trimming history if necessary. Custom strategies can be implemented or one of the built-in options can be used (e.g. NoopContextStrategy, MessageCountContextStrategy or the default SlidingWindowContextStrategy).
toolsConfig - Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:
- tools - List of objects defining tools.
- executeToolCallback - Function that accepts ToolCall, executes tool and returns the string to model.
- displayToolCalls - If set to true, JSON tool calls will be displayed in chat. If false, only answers will be displayed.
generationConfig - Object configuring generation settings with following properties:
- outputTokenBatchSize - Soft upper limit on the number of tokens in each token batch (in certain cases there can be more tokens in given batch, i.e. when the batch would end with special emoji join character).
- batchTimeInterval - Upper limit on the time interval between consecutive token batches.
- temperature - Scales output logits by the inverse of temperature. Controls the randomness / creativity of text generation.
- topP - Only samples from the smallest set of tokens whose cumulative probability exceeds topP. Range [0, 1]. Values of 0 or 1 disable top-p filtering.
- minP - Minimum-probability threshold applied after softmax: tokens whose probability is below minP * max_prob are excluded from sampling. Range [0, 1]. Default 0 disables the filter. Stacks with topP when both are set.
- repetitionPenalty - Multiplicative penalty applied to logits of tokens that already appeared in the prompt or the generated text. Values greater than 1 discourage repetition; default 1 disables the penalty.

Built-in models ship with sampling defaults

Model presets expose an optional generationConfig that LLMModule.fromModelName applies automatically when available — for Qwen3 and LFM2-VL this means the model-card recommended sampling settings are in effect without any explicit configure call. Any fields you pass to configure still override on a per-field basis.

Vision-Language Models (VLM)

Some models support multimodal input — text, images and/or audio together. To use them, pass capabilities in the model object when calling fromModelName:

import { models, LLMModule } from 'react-native-executorch';
const llm = await LLMModule.fromModelName(
  models.llm.gemma4_e2b_multimodal(),
  undefined,
  (token) => console.log(token)
);

The capabilities field is already set on the model constant. You can also construct the model object explicitly:

const llm = await LLMModule.fromModelName({
  modelName: 'gemma4-e2b-multimodal',
  modelSource: require('./path/to/model.pte'),
  tokenizerSource: require('./path/to/tokenizer.json'),
  tokenizerConfigSource: require('./path/to/tokenizer_config.json'),
  capabilities: ['vision', 'audio'],
});

Once loaded, pass imagePath or audioBuffer to sendMessage:

const response = await llm.sendMessage('What is in this image?', {
  imagePath: '/path/to/image.jpg',
});
// or
const response = await llm.sendMessage('What can you hear?', {
  audioBuffer: audioRecording, //expected as waveform 16kHz
});

Or use generate with mediaPath on the message:

const chat: Message[] = [
  {
    role: 'user',
    content: 'Describe this image.',
    mediaPath: '/path/to/image.jpg',
  },
];
// or
const chat: Message[] = [
  {
    role: 'user',
    content: 'Transcribe the recording.',
    audioWaveform: audioRecording,
  },
];
const response = await llm.generate(chat);

Using a custom model

Use fromCustomModel to load your own exported LLM instead of a built-in preset:

import { LLMModule } from 'react-native-executorch';
const llm = await LLMModule.fromCustomModel(
  'https://example.com/model.pte',
  'https://example.com/tokenizer.json',
  'https://example.com/tokenizer_config.json',
  (progress) => console.log(progress),
  (token) => console.log(token),
  (messages) => console.log(messages)
);

Required model contract

The .pte model binary must be exported following the ExecuTorch LLM export process. The native runner expects the standard ExecuTorch text-generation interface — KV-cache management, prefill/decode phases, and logit sampling are all handled by the runtime.

Deleting the model from memory

To delete the model from memory, you can use the delete method.

API Reference​

High Level Overview​

Methods​

Loading the model​

Listening for download progress​

Running the model​

Listening for generated tokens​

Interrupting the model​

Token Batching​

Configuring the model​

Vision-Language Models (VLM)​

Using a custom model​

Required model contract​

Deleting the model from memory​