Version: Next

useImageEmbeddings

Image Embedding is the process of converting an image into a numerical representation. This representation can be used for tasks, such as classification, clustering and (using contrastive learning like e.g. CLIP model) image search.

caution

It is recommended to use models provided by us, which are available at our Hugging Face repository. You can also use constants shipped with our library.

Reference

import {
  useImageEmbeddings,
  CLIP_VIT_BASE_PATCH32_IMAGE,
} from 'react-native-executorch';

const model = useImageEmbeddings({ model: CLIP_VIT_BASE_PATCH32_IMAGE });

try {
  const imageEmbedding = await model.forward('https://url-to-image.jpg');
} catch (error) {
  console.error(error);
}

Arguments

model - Object containing the model source.

modelSource - A string that specifies the location of the model binary.

preventLoad? - Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

For more information on loading resources, take a look at loading models page.

Returns

Field	Type	Description
`forward`	`(imageSource: string) => Promise<Float32Array>`	Executes the model's forward pass, where `imageSource` is a URI/URL to image that will be embedded.
`error`	`string \| null`	Contains the error message if the model failed to load.
`isGenerating`	`boolean`	Indicates whether the model is currently processing an inference.
`isReady`	`boolean`	Indicates whether the model has successfully loaded and is ready for inference.
`downloadProgress`	`number`	Represents the download progress as a value between 0 and 1.

Running the model

To run the model, you can use the forward method. It accepts one argument which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.

Example

const dotProduct = (a: Float32Array, b: Float32Array) =>
  a.reduce((sum, val, i) => sum + val * b[i], 0);

const cosineSimilarity = (a: Float32Array, b: Float32Array) => {
  const dot = dotProduct(a, b);
  const normA = Math.sqrt(dotProduct(a, a));
  const normB = Math.sqrt(dotProduct(b, b));
  return dot / (normA * normB);
};

try {
  // we assume you've provided catImage and dogImage
  const catImageEmbedding = await model.forward(catImage);
  const dogImageEmbedding = await model.forward(dogImage);

  const similarity = cosineSimilarity(catImageEmbedding, dogImageEmbedding);

  console.log(`Cosine similarity: ${similarity}`);
} catch (error) {
  console.error(error);
}

Supported models

Model	Language	Image size	Embedding dimensions	Description
clip-vit-base-patch32-image	English	224×224	512	CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout clip-vit-base-patch32-text.

Image size - the size of an image that the model takes as an input. Resize will happen automatically.

Embedding Dimensions - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input image.

info

For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.

Benchmarks

Model size

Model	XNNPACK [MB]
CLIP_VIT_BASE_PATCH32_IMAGE	352

Memory usage

Model	Android (XNNPACK) [MB]	iOS (XNNPACK) [MB]
CLIP_VIT_BASE_PATCH32_IMAGE	350	340

Inference time

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.

Model	iPhone 17 Pro (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
CLIP_VIT_BASE_PATCH32_IMAGE	18	55

info

Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.

Reference​

Arguments​

Returns​

Running the model​

Example​

Supported models​

Benchmarks​

Model size​

Memory usage​

Inference time​