Skip to main content
Version: Next

Inference Time

info

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

Inference times are measured directly from native C++ code, wrapping only the model's forward pass, excluding input-dependent pre- and post-processing (e.g. image resizing, normalization) and any overhead from React Native runtime.

Classification

note

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
EFFICIENTNET_V2_S (XNNPACK FP32)70100
EFFICIENTNET_V2_S (XNNPACK INT8)2238
EFFICIENTNET_V2_S (Core ML FP32)12-
EFFICIENTNET_V2_S (Core ML FP16)5-

Object Detection

note

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Times presented in the tables are measured for YOLO models with input size equal to 512. Other input sizes may yield slower or faster inference times. RF-DETR Nano uses a fixed resolution of 312×312.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
SSDLITE_320_MOBILENET_V3_LARGE (XNNPACK FP32)2018
SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP32)18-
SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP16)8-
RF_DETR_NANO (XNNPACK FP32)101277
YOLO26N (XNNPACK FP32)2938
YOLO26S (XNNPACK FP32)6072
YOLO26M (XNNPACK FP32)134177
YOLO26L (XNNPACK FP32)169216
YOLO26X (XNNPACK FP32)371434

Style Transfer

note

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
STYLE_TRANSFER_CANDY (XNNPACK FP32)11921025
STYLE_TRANSFER_CANDY (XNNPACK INT8)272430
STYLE_TRANSFER_CANDY (Core ML FP32)100-
STYLE_TRANSFER_CANDY (Core ML FP16)150-
STYLE_TRANSFER_MOSAIC (XNNPACK FP32)11921025
STYLE_TRANSFER_MOSAIC (XNNPACK INT8)272430
STYLE_TRANSFER_MOSAIC (Core ML FP32)100-
STYLE_TRANSFER_MOSAIC (Core ML FP16)150-
STYLE_TRANSFER_UDNIE (XNNPACK FP32)11921025
STYLE_TRANSFER_UDNIE (XNNPACK INT8)272430
STYLE_TRANSFER_UDNIE (Core ML FP32)100-
STYLE_TRANSFER_UDNIE (Core ML FP16)150-
STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK FP32)11921025
STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK INT8)272430
STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP32)100-
STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP16)150-

OCR

Notice that the recognizer models were executed between 3 and 7 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro [ms]iPhone 16 Pro [ms]iPhone SE 3Samsung Galaxy S24 [ms]OnePlus 12 [ms]
Total Inference Time652600285510921034
Detector (CRAFT) forward_8002202211740521492
Recognizer (CRNN) forward_51245381104038
Recognizer (CRNN) forward_2562118542019
Recognizer (CRNN) forward_128119271010

Vertical OCR

note

Recognizer models, as well as detector's forward_320 method, were executed between 4 and 21 times during a single recognition.

The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro
[ms]
iPhone 16 Pro
[ms]
iPhone SE 3Samsung Galaxy S24
[ms]
OnePlus 12
[ms]
Total Inference Time11041113884028452640
Detector (CRAFT) forward_1280501507431714051275
Detector (CRAFT) forward_3201251211060338299
Recognizer (CRNN) forward_51246421094737
Recognizer (CRNN) forward_64561476

LLMs

ModeliPhone 16 Pro (XNNPACK) [tokens/s]iPhone 13 Pro (XNNPACK) [tokens/s]iPhone SE 3 (XNNPACK) [tokens/s]Samsung Galaxy S24 (XNNPACK) [tokens/s]OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B16.111.415.619.3
LLAMA3_2_1B_SPINQUANT40.616.716.540.348.2
LLAMA3_2_1B_QLORA31.811.411.237.344.4
LLAMA3_2_3B7.1
LLAMA3_2_3B_SPINQUANT17.28.216.219.4
LLAMA3_2_3B_QLORA14.514.818.1

❌ - Insufficient RAM.

Speech to Text

Encoding

Average time for encoding audio of given length over 10 runs. For Whisper model we only list 30 sec audio chunks since Whisper does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)8993403277260

Decoding

Average time for decoding one token in sequence of approximately 100 tokens, with encoding context is obtained from audio of noted length.

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)66402825

Text to Speech

Average time to synthesize speech from an input text of approximately 60 tokens, resulting in 2 to 5 seconds of audio depending on the input and selected voice.

ModeliPhone 17 Pro (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Kokoro-small20511548
Kokoro-medium21241625

Text Embeddings

note

Benchmark times for text embeddings are highly dependent on the sentence length. The numbers below are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.

ModeliPhone 17 Pro (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
ALL_MINILM_L6_V2721
ALL_MPNET_BASE_V22490
MULTI_QA_MINILM_L6_COS_V1719
MULTI_QA_MPNET_BASE_DOT_V12488
CLIP_VIT_BASE_PATCH32_TEXT1439

Image Embeddings

note

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK FP32)1468
CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK INT8)1131

Semantic Segmentation

note

For this model all input images, whether larger or smaller, are resized before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total time.

Model / DeviceiPhone 17 Pro [ms]Google Pixel 10 [ms]
DEEPLAB_V3_RESNET50 (XNNPACK FP32)20002200
DEEPLAB_V3_RESNET50 (XNNPACK INT8)118380
DEEPLAB_V3_RESNET101 (XNNPACK FP32)29003300
DEEPLAB_V3_RESNET101 (XNNPACK INT8)174660
DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK FP32)131153
DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK INT8)1740
LRASPP_MOBILENET_V3_LARGE (XNNPACK FP32)1336
LRASPP_MOBILENET_V3_LARGE (XNNPACK INT8)1220
FCN_RESNET50 (XNNPACK FP32)18002160
FCN_RESNET50 (XNNPACK INT8)100320
FCN_RESNET101 (XNNPACK FP32)26003160
FCN_RESNET101 (XNNPACK INT8)160620

Instance Segmentation

note

Times presented in the tables are measured for YOLO models with input size equal to 512. Other input sizes may yield slower or faster inference times. RF-DETR Nano Seg uses a fixed resolution of 312×312.

ModelSamsung Galaxy S24 (XNNPACK) [ms]Iphone 17 pro (XNNPACK) [ms]
YOLO26N_SEG9290
YOLO26S_SEG220188
YOLO26M_SEG570550
YOLO26L_SEG680608
YOLO26X_SEG14101338
RF_DETR_NANO_SEG549330

Text to image

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
BK_SDM_TINY_VPRED_25621184210211883416617