LLMModule
TypeScript API implementation of the useLLM hook.
API Reference
- For detailed API Reference for
LLMModulesee:LLMModuleAPI Reference. - For all LLM models available out-of-the-box in React Native ExecuTorch see: LLM Models.
- For useful LLM utility functionalities please refer to the following link: LLM Utility Functionalities.
High Level Overview
import { LLMModule, LLAMA3_2_1B_QLORA } from 'react-native-executorch';
// Creating an instance and loading the model
const llm = await LLMModule.fromModelName(
LLAMA3_2_1B_QLORA,
(progress) => console.log(progress),
(token) => console.log(token),
(messages) => console.log(messages)
);
// Running the model - returns the generated response
const response = await llm.sendMessage('Hello, World!');
console.log('Response:', response);
// Interrupting the model (to actually interrupt the generation it has to be called when sendMessage or generate is running)
llm.interrupt();
// Deleting the model from memory
llm.delete();
Methods
All methods of LLMModule are explained in details here: LLMModule API Reference.
Loading the model
Use the static fromModelName factory method:
const llm = await LLMModule.fromModelName(
LLAMA3_2_3B, // model config constant
onDownloadProgress, // optional, progress 0–1
tokenCallback, // optional, called on every token
messageHistoryCallback // optional, called when generation finishes
);
The model config object contains modelSource, tokenizerSource, tokenizerConfigSource, and optional capabilities. Pass one of the built-in constants (e.g. LLAMA3_2_3B) or construct it manually.
This method returns a promise resolving to an LLMModule instance.
For more information on loading resources, take a look at loading models page.
Listening for download progress
To subscribe to the download progress event, you can pass the onDownloadProgress callback as the second argument to fromModelName. This function is called whenever the download progress changes.
Running the model
To run the model, you can use generate method. It allows you to pass chat messages and receive completion from the model. It doesn't provide any message history management.
Alternatively in managed chat (see: Functional vs managed), you can use the sendMessage method. It accepts the user message and returns a promise that resolves to the generated response. Additionally, it will call messageHistoryCallback with the updated message history containing both user message and model response.
If you need raw model access without any wrappers, you can use forward. It provides direct access to the model, so the input string is passed straight into the model and returns the generated response. It may be useful to work with models that aren't finetuned for chat completions. If you're not sure what are implications of that (e.g. that you have to include special model tokens), you're better off with sendMessage.
Listening for generated tokens
To subscribe to the token generation event, you can pass tokenCallback or messageHistoryCallback functions to the constructor. tokenCallback is called on every token and contains only the most recent token. messageHistoryCallback is called whenever model finishes generation and contains all message history including user's and model's last messages.
Interrupting the model
In order to interrupt the model, you can use the interrupt method.
Token Batching
Depending on selected model and the user's device generation speed can be above 60 tokens per second. If the tokenCallback triggers rerenders and is invoked on every single token it can significantly decrease the app's performance. To alleviate this and help improve performance we've implemented token batching. To configure this you need to call configure method and pass generationConfig. In the next section, there are listed what you can tweak with this config.
Configuring the model
To configure model (i.e. change system prompt, load initial conversation history or manage tool calling, set generation settings) you can use
configure method. chatConfig and toolsConfig is only applied to managed chats i.e. when using sendMessage (see: Functional vs managed) It accepts object with following fields:
-
chatConfig- Object configuring chat management that contains:-
systemPrompt- Often used to tell the model what is its purpose, for example - "Be a helpful translator". -
initialMessageHistory- Object that represent the conversation history. This can be used to provide initial context to the model. -
contextStrategy- Object implementingContextStrategyinterface used to manage conversation context, including trimming history if necessary. Custom strategies can be implemented or one of the built-in options can be used (e.g.NoopContextStrategy,MessageCountContextStrategyor the defaultSlidingWindowContextStrategy).
-
-
toolsConfig- Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:-
tools- List of objects defining tools. -
executeToolCallback- Function that acceptsToolCall, executes tool and returns the string to model. -
displayToolCalls- If set totrue, JSON tool calls will be displayed in chat. Iffalse, only answers will be displayed.
-
-
generationConfig- Object configuring generation settings with following properties:-
outputTokenBatchSize- Soft upper limit on the number of tokens in each token batch (in certain cases there can be more tokens in given batch, i.e. when the batch would end with special emoji join character). -
batchTimeInterval- Upper limit on the time interval between consecutive token batches. -
temperature- Scales output logits by the inverse of temperature. Controls the randomness / creativity of text generation. -
topp- Only samples from the smallest set of tokens whose cumulative probability exceeds topp.
-
Vision-Language Models (VLM)
Some models support multimodal input — text and images together. To use them, pass capabilities in the model object when calling fromModelName:
import { LLMModule, LFM2_VL_1_6B_QUANTIZED } from 'react-native-executorch';
const llm = await LLMModule.fromModelName(
LFM2_VL_1_6B_QUANTIZED,
undefined,
(token) => console.log(token)
);
The capabilities field is already set on the model constant. You can also construct the model object explicitly:
const llm = await LLMModule.fromModelName({
modelName: 'lfm2.5-vl-1.6b-quantized',
modelSource: require('./path/to/model.pte'),
tokenizerSource: require('./path/to/tokenizer.json'),
tokenizerConfigSource: require('./path/to/tokenizer_config.json'),
capabilities: ['vision'],
});
Once loaded, pass imagePath to sendMessage:
const response = await llm.sendMessage('What is in this image?', {
imagePath: '/path/to/image.jpg',
});
Or use generate with mediaPath on the message:
const chat: Message[] = [
{
role: 'user',
content: 'Describe this image.',
mediaPath: '/path/to/image.jpg',
},
];
const response = await llm.generate(chat);
Using a custom model
Use fromCustomModel to load your own exported LLM instead of a built-in preset:
import { LLMModule } from 'react-native-executorch';
const llm = await LLMModule.fromCustomModel(
'https://example.com/model.pte',
'https://example.com/tokenizer.json',
'https://example.com/tokenizer_config.json',
(progress) => console.log(progress),
(token) => console.log(token),
(messages) => console.log(messages)
);
Required model contract
The .pte model binary must be exported following the ExecuTorch LLM export process. The native runner expects the standard ExecuTorch text-generation interface — KV-cache management, prefill/decode phases, and logit sampling are all handled by the runtime.
Deleting the model from memory
To delete the model from memory, you can use the delete method.