Skip to main content
Version: Next

Inference Time

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

Classification

ModeliPhone 16 Pro (Core ML) [ms]iPhone 13 Pro (Core ML) [ms]iPhone SE 3 (Core ML) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
EFFICIENTNET_V2_S100120130180170

Object Detection

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 13 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
SSDLITE_320_MOBILENET_V3_LARGE19026028010090

Style Transfer

ModeliPhone 16 Pro (Core ML) [ms]iPhone 13 Pro (Core ML) [ms]iPhone SE 3 (Core ML) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
STYLE_TRANSFER_CANDY45060075016501800
STYLE_TRANSFER_MOSAIC45060075016501800
STYLE_TRANSFER_UDNIE45060075016501800
STYLE_TRANSFER_RAIN_PRINCESS45060075016501800

OCR

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro Max (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]Samsung Galaxy S21 (XNNPACK) [ms]
Detector (CRAFT_800)2099222722457108
Recognizer (CRNN_512)7025254151
Recognizer (CRNN_256)391232478
Recognizer (CRNN_128)17831439

❌ - Insufficient RAM.

Vertical OCR

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro Max (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]Samsung Galaxy S21 (XNNPACK) [ms]
Detector (CRAFT_1280)54575833629614053
Detector (CRAFT_320)1351146014853101
Recognizer (CRNN_512)391232478
Recognizer (CRNN_64)1033718

❌ - Insufficient RAM.

LLMs

ModeliPhone 16 Pro (XNNPACK) [tokens/s]iPhone 13 Pro (XNNPACK) [tokens/s]iPhone SE 3 (XNNPACK) [tokens/s]Samsung Galaxy S24 (XNNPACK) [tokens/s]OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B16.111.415.619.3
LLAMA3_2_1B_SPINQUANT40.616.716.540.348.2
LLAMA3_2_1B_QLORA31.811.411.237.344.4
LLAMA3_2_3B7.1
LLAMA3_2_3B_SPINQUANT17.28.216.219.4
LLAMA3_2_3B_QLORA14.514.818.1

❌ - Insufficient RAM.

Speech to text

Streaming mode

Notice than for Whisper model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) fast mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. That's why for the lowest latency and the fastest transcription we suggest using Moonshine model, if you still want to proceed with Whisper use preferably the balanced mode.

Model (mode)iPhone 16 Pro (XNNPACK) [latency | tokens/s]iPhone 14 Pro (XNNPACK) [latency | tokens/s]iPhone SE 3 (XNNPACK) [latency | tokens/s]Samsung Galaxy S24 (XNNPACK) [latency | tokens/s]OnePlus 12 (XNNPACK) [latency | tokens/s]
Moonshine-tiny (fast)0.8s | 19.0t/s1.5s | 11.3t/s1.5s | 10.4t/s2.0s | 8.8t/s1.6s | 12.5t/s
Moonshine-tiny (balanced)2.0s | 20.0t/s3.2s | 12.4t/s3.7s | 10.4t/s4.6s | 11.2t/s3.4s | 14.6t/s
Moonshine-tiny (quality)4.3s | 16.8t/s6.6s | 10.8t/s8.0s | 8.9t/s7.7s | 11.1t/s6.8s | 13.1t/s
Whisper-tiny (fast)2.8s | 5.5t/s3.7s | 4.4t/s4.4s | 3.4t/s5.5s | 3.1t/s5.3s | 3.8t/s
Whisper-tiny (balanced)5.6s | 7.9t/s7.0s | 6.3t/s8.3s | 5.0t/s8.4s | 6.7t/s7.7s | 7.2t/s
Whisper-tiny (quality)10.3s | 8.3t/s12.6s | 6.8t/s7.8s | 8.9t/s13.5s | 7.1t/s12.9s | 7.5t/s

Encoding

Average time for encoding audio of given length over 10 runs. For Whisper model we only list 30 sec audio chunks since Whisper does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Moonshine-tiny (5s)9995115284277
Moonshine-tiny (10s)178177204555528
Moonshine-tiny (30s)58057668917261617
Whisper-tiny (30s)10341344126929162143

Decoding

Average time for decoding one token in sequence of 100 tokens, with encoding context is obtained from audio of noted length.

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Moonshine-tiny (5s)48.9847.9846.8636.7029.03
Moonshine-tiny (10s)54.2451.7455.0746.3132.41
Moonshine-tiny (30s)76.3876.1987.3765.6145.04
Whisper-tiny (30s)128.03113.65141.6389.0884.49

Text Embeddings

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro Max (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK)OnePlus 12 (XNNPACK) [ms]
ALL_MINILM_L6_V21522233631
ALL_MPNET_BASE_V27196101112105
MULTI_QA_MINILM_L6_COS_V11522233631
MULTI_QA_MPNET_BASE_DOT_V17195100112105
CLIP_VIT_BASE_PATCH32_TEXT3147485549
info

Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.

Image Embeddings

ModeliPhone 16 Pro (XNNPACK) [ms]iPhone 14 Pro Max (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
CLIP_VIT_BASE_PATCH32_IMAGE4864696563
info

Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.