Skip to main content
Version: 0.4.x

Inference Time

warning

Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.

Classification

ModeliPhone 17 Pro (Core ML) [ms]iPhone 16 Pro (Core ML) [ms]iPhone SE 3 (Core ML) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
EFFICIENTNET_V2_S150161227196214

Object Detection

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
SSDLITE_320_MOBILENET_V3_LARGE261279414125115

Style Transfer

ModeliPhone 17 Pro (Core ML) [ms]iPhone 16 Pro (Core ML) [ms]iPhone SE 3 (Core ML) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
STYLE_TRANSFER_CANDY15651675232517501620
STYLE_TRANSFER_MOSAIC15651675232517501620
STYLE_TRANSFER_UDNIE15651675232517501620
STYLE_TRANSFER_RAIN_PRINCESS15651675232517501620

OCR

Notice that the recognizer models were executed between 3 and 7 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Detector (CRAFT_800_QUANTIZED)7798971276553586
Recognizer (CRNN_512)77742445657
Recognizer (CRNN_256)35371202830
Recognizer (CRNN_128)1819601416

Vertical OCR

Notice that the recognizer models, as well as detector CRAFT_320 model, were executed between 4 and 21 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Detector (CRAFT_1280_QUANTIZED)19182304337113911445
Detector (CRAFT_320_QUANTIZED)473563813361382
Recognizer (CRNN_512)78833105957
Recognizer (CRNN_64)993887

LLMs

ModeliPhone 17 Pro (XNNPACK) [tokens/s]iPhone 16 Pro (XNNPACK) [tokens/s]iPhone SE 3 (XNNPACK) [tokens/s]Samsung Galaxy S24 (XNNPACK) [tokens/s]OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B16.111.415.619.3
LLAMA3_2_1B_SPINQUANT40.616.716.540.348.2
LLAMA3_2_1B_QLORA31.811.411.237.344.4
LLAMA3_2_3B7.1
LLAMA3_2_3B_SPINQUANT17.28.216.219.4
LLAMA3_2_3B_QLORA14.514.818.1

❌ - Insufficient RAM.

Speech to text

Streaming mode

Notice than for Whisper model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) fast mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. That's why for the lowest latency and the fastest transcription we suggest using Moonshine model, if you still want to proceed with Whisper use preferably the balanced mode.

Model (mode)iPhone 17 Pro (XNNPACK) [latency | tokens/s]iPhone 16 Pro (XNNPACK) [latency | tokens/s]iPhone SE 3 (XNNPACK) [latency | tokens/s]Samsung Galaxy S24 (XNNPACK) [latency | tokens/s]OnePlus 12 (XNNPACK) [latency | tokens/s]
Moonshine-tiny (fast)0.8s | 19.0t/s1.5s | 11.3t/s1.5s | 10.4t/s2.0s | 8.8t/s1.6s | 12.5t/s
Moonshine-tiny (balanced)2.0s | 20.0t/s3.2s | 12.4t/s3.7s | 10.4t/s4.6s | 11.2t/s3.4s | 14.6t/s
Moonshine-tiny (quality)4.3s | 16.8t/s6.6s | 10.8t/s8.0s | 8.9t/s7.7s | 11.1t/s6.8s | 13.1t/s
Whisper-tiny (fast)2.8s | 5.5t/s3.7s | 4.4t/s4.4s | 3.4t/s5.5s | 3.1t/s5.3s | 3.8t/s
Whisper-tiny (balanced)5.6s | 7.9t/s7.0s | 6.3t/s8.3s | 5.0t/s8.4s | 6.7t/s7.7s | 7.2t/s
Whisper-tiny (quality)10.3s | 8.3t/s12.6s | 6.8t/s7.8s | 8.9t/s13.5s | 7.1t/s12.9s | 7.5t/s

Encoding

Average time for encoding audio of given length over 10 runs. For Whisper model we only list 30 sec audio chunks since Whisper does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Moonshine-tiny (5s)9995115284277
Moonshine-tiny (10s)178177204555528
Moonshine-tiny (30s)58057668917261617
Whisper-tiny (30s)10341344126929162143

Decoding

Average time for decoding one token in sequence of 100 tokens, with encoding context is obtained from audio of noted length.

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
Moonshine-tiny (5s)48.9847.9846.8636.7029.03
Moonshine-tiny (10s)54.2451.7455.0746.3132.41
Moonshine-tiny (30s)76.3876.1987.3765.6145.04
Whisper-tiny (30s)128.03113.65141.6389.0884.49

Text Embeddings

ModeliPhone 17 Pro (XNNPACK) [ms]iPhone 16 Pro (XNNPACK) [ms]iPhone SE 3 (XNNPACK) [ms]Samsung Galaxy S24 (XNNPACK) [ms]OnePlus 12 (XNNPACK) [ms]
ALL_MINILM_L6_V25058845858
ALL_MPNET_BASE_V2352428879483517
MULTI_QA_MINILM_L6_COS_V1133161269151155
MULTI_QA_MPNET_BASE_DOT_V15027961216915713