Version: 0.7.x

LLMModule

TypeScript API implementation of the useLLM hook.

API Reference

For detailed API Reference for LLMModule see: LLMModule API Reference.
For all LLM models available out-of-the-box in React Native ExecuTorch see: LLM Models.
For useful LLM utility functionalities please refer to the following link: LLM Utility Functionalities.

High Level Overview

import { LLMModule, LLAMA3_2_1B_QLORA } from 'react-native-executorch';

// Creating an instance
const llm = new LLMModule({
  tokenCallback: (token) => console.log(token),
  messageHistoryCallback: (messages) => console.log(messages),
});

// Loading the model
await llm.load(LLAMA3_2_1B_QLORA, (progress) => console.log(progress));

// Running the model - returns the generated response
const response = await llm.sendMessage('Hello, World!');
console.log('Response:', response);

// Interrupting the model (to actually interrupt the generation it has to be called when sendMessage or generate is running)
llm.interrupt();

// Deleting the model from memory
llm.delete();

Methods

All methods of LLMModule are explained in details here: LLMModule API Reference.

Loading the model

To create a new instance of LLMModule, use the constructor with optional callbacks:

tokenCallback - Function called on every generated token.
messageHistoryCallback - Function called on every finished message.

Then, to load the model, use the load method. It accepts an object with the following fields:

model - Object containing:
- modelSource - The location of the used model.
- tokenizerSource - The location of the used tokenizer.
- tokenizerConfigSource - The location of the used tokenizer config.
onDownloadProgressCallback - Callback to track download progress.

This method returns a promise, which can resolve to an error or void.

For more information on loading resources, take a look at loading models page.

Listening for download progress

To subscribe to the download progress event, you can pass the onDownloadProgressCallback function to the load method. This function is called whenever the download progress changes.

Running the model

To run the model, you can use generate method. It allows you to pass chat messages and receive completion from the model. It doesn't provide any message history management.

Alternatively in managed chat (see: Functional vs managed), you can use the sendMessage method. It accepts the user message and returns a promise that resolves to the generated response. Additionally, it will call messageHistoryCallback with the updated message history containing both user message and model response.

If you need raw model access without any wrappers, you can use forward. It provides direct access to the model, so the input string is passed straight into the model and returns the generated response. It may be useful to work with models that aren't finetuned for chat completions. If you're not sure what are implications of that (e.g. that you have to include special model tokens), you're better off with sendMessage.

Listening for generated tokens

To subscribe to the token generation event, you can pass tokenCallback or messageHistoryCallback functions to the constructor. tokenCallback is called on every token and contains only the most recent token. messageHistoryCallback is called whenever model finishes generation and contains all message history including user's and model's last messages.

Interrupting the model

In order to interrupt the model, you can use the interrupt method.

Token Batching

Depending on selected model and the user's device generation speed can be above 60 tokens per second. If the tokenCallback triggers rerenders and is invoked on every single token it can significantly decrease the app's performance. To alleviate this and help improve performance we've implemented token batching. To configure this you need to call configure method and pass generationConfig. In the next section, there are listed what you can tweak with this config.

Configuring the model

To configure model (i.e. change system prompt, load initial conversation history or manage tool calling, set generation settings) you can use configure method. chatConfig and toolsConfig is only applied to managed chats i.e. when using sendMessage (see: Functional vs managed) It accepts object with following fields:

chatConfig - Object configuring chat management that contains:
- systemPrompt - Often used to tell the model what is its purpose, for example - "Be a helpful translator".
- initialMessageHistory - Object that represent the conversation history. This can be used to provide initial context to the model.
- contextWindowLength - The number of messages from the current conversation that the model will use to generate a response. Keep in mind that using larger context windows will result in longer inference time and higher memory usage.
toolsConfig - Object configuring options for enabling and managing tool use. It will only have effect if your model's chat template support it. Contains following properties:
- tools - List of objects defining tools.
- executeToolCallback - Function that accepts ToolCall, executes tool and returns the string to model.
- displayToolCalls - If set to true, JSON tool calls will be displayed in chat. If false, only answers will be displayed.
generationConfig - Object configuring generation settings with following properties:
- outputTokenBatchSize - Soft upper limit on the number of tokens in each token batch (in certain cases there can be more tokens in given batch, i.e. when the batch would end with special emoji join character).
- batchTimeInterval - Upper limit on the time interval between consecutive token batches.
- temperature - Scales output logits by the inverse of temperature. Controls the randomness / creativity of text generation.
- topp - Only samples from the smallest set of tokens whose cumulative probability exceeds topp.

Deleting the model from memory

To delete the model from memory, you can use the delete method.

API Reference​

High Level Overview​

Methods​

Loading the model​

Listening for download progress​

Running the model​

Listening for generated tokens​

Interrupting the model​

Token Batching​

Configuring the model​

Deleting the model from memory​