Skip to main content
Version: Next

Inference Time

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

Classification

info

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

info

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
EFFICIENTNET_V2_S (XNNPACK FP32)70100
EFFICIENTNET_V2_S (XNNPACK INT8)2238
EFFICIENTNET_V2_S (Core ML FP32)12-
EFFICIENTNET_V2_S (Core ML FP16)5-

Object Detection

info

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

info

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
SSDLITE_320_MOBILENET_V3_LARGE (XNNPACK FP32)2018
SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP32)18-
SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP16)8-

Style Transfer

info

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

info

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
STYLE_TRANSFER_CANDY (XNNPACK FP32)11921025
STYLE_TRANSFER_CANDY (XNNPACK INT8)272430
STYLE_TRANSFER_CANDY (Core ML FP32)100-
STYLE_TRANSFER_CANDY (Core ML FP16)150-
STYLE_TRANSFER_MOSAIC (XNNPACK FP32)11921025
STYLE_TRANSFER_MOSAIC (XNNPACK INT8)272430
STYLE_TRANSFER_MOSAIC (Core ML FP32)100-
STYLE_TRANSFER_MOSAIC (Core ML FP16)150-
STYLE_TRANSFER_UDNIE (XNNPACK FP32)11921025
STYLE_TRANSFER_UDNIE (XNNPACK INT8)272430
STYLE_TRANSFER_UDNIE (Core ML FP32)100-
STYLE_TRANSFER_UDNIE (Core ML FP16)150-
STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK FP32)11921025
STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK INT8)272430
STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP32)100-
STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP16)150-

OCR

Notice that the recognizer models were executed between 3 and 7 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro [ms]iPhone 16 Pro [ms]iPhone SE 3Samsung Galaxy S24 [ms]OnePlus 12 [ms]
Total Inference Time652600285510921034
Detector (CRAFT) forward_8002202211740521492
Recognizer (CRNN) forward_51245381104038
Recognizer (CRNN) forward_2562118542019
Recognizer (CRNN) forward_128119271010

Vertical OCR

Notice that the recognizer models, as well as detector's forward_320 method, were executed between 4 and 21 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro
[ms]
iPhone 16 Pro
[ms]
iPhone SE 3Samsung Galaxy S24
[ms]
OnePlus 12
[ms]
Total Inference Time11041113884028452640
Detector (CRAFT) forward_1280501507431714051275
Detector (CRAFT) forward_3201251211060338299
Recognizer (CRNN) forward_51246421094737
Recognizer (CRNN) forward_64561476

LLMs

ModeliPhone 16 Pro (XNNPACK) [tokens/s]iPhone 13 Pro (XNNPACK) [tokens/s]iPhone SE 3 (XNNPACK) [tokens/s]Samsung Galaxy S24 (XNNPACK) [tokens/s]OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B16.111.415.619.3
LLAMA3_2_1B_SPINQUANT40.616.716.540.348.2
LLAMA3_2_1B_QLORA31.811.411.237.344.4
LLAMA3_2_3B7.1
LLAMA3_2_3B_SPINQUANT17.28.216.219.4
LLAMA3_2_3B_QLORA14.514.818.1

❌ - Insufficient RAM.

Speech to Text

Encoding

Average time for encoding audio of given length over 10 runs. For Whisper model we only list 30 sec audio chunks since Whisper does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)2482541145435526

Decoding

Average time for decoding one token in sequence of approximately 100 tokens, with encoding context is obtained from audio of noted length.

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)232512192115

Text to Speech

Average time to synthesize speech from an input text of approximately 60 tokens, resulting in 2 to 5 seconds of audio depending on the input and selected voice.

ModeliPhone 17 Pro (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Kokoro-small20511548
Kokoro-medium21241625

Text Embeddings

ModeliPhone 17 Pro (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
ALL_MINILM_L6_V2721
ALL_MPNET_BASE_V22490
MULTI_QA_MINILM_L6_COS_V1719
MULTI_QA_MPNET_BASE_DOT_V12488
CLIP_VIT_BASE_PATCH32_TEXT1439
info

Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.

Image Embeddings

info

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

info

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK FP32)1468
CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK INT8)1131

Semantic Segmentation

info

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

info

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
DEEPLAB_V3_RESNET50 (XNNPACK FP32)20002200
DEEPLAB_V3_RESNET50 (XNNPACK INT8)118380
DEEPLAB_V3_RESNET101 (XNNPACK FP32)29003300
DEEPLAB_V3_RESNET101 (XNNPACK INT8)174660
DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK FP32)131153
DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK INT8)1740
LRASPP_MOBILENET_V3_LARGE (XNNPACK FP32)1336
LRASPP_MOBILENET_V3_LARGE (XNNPACK INT8)1220
FCN_RESNET50 (XNNPACK FP32)18002160
FCN_RESNET50 (XNNPACK INT8)100320
FCN_RESNET101 (XNNPACK FP32)26003160
FCN_RESNET101 (XNNPACK INT8)160620

Instance Segmentation

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

warning

Times presented in the tables are measured for forward method with input size equal to 512. Other input sizes may yield slower or faster inference times.

ModelSamsung Galaxy S24 (XNNPACK) [ms]Iphone 17 pro (XNNPACK) [ms]
YOLO26N_SEG9290
YOLO26S_SEG220188
YOLO26M_SEG570550
YOLO26L_SEG680608
YOLO26X_SEG14101338
RF_DETR_NANO_SEG549330

Text to image

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
BK_SDM_TINY_VPRED_25621184210211883416617