Chatterbox Turbo CoreML+ONNX

On-device text-to-speech for Apple Silicon. Hybrid CoreML + ONNX Runtime pipeline β€” ~1.5s latency, 11x faster than MLX.

Based on ResembleAI/chatterbox-turbo.

Benchmarks (Apple M3 Pro)

Stage Time Engine
T3 Prefill (380 tokens) ~170ms CoreML (CPU+GPU)
T3 Decode (per token) ~43ms ONNX Runtime C API
Conditional Decoder ~850ms ONNX Runtime
Total (~20 tokens) ~1.5s
Short phrase ~1.0s

Comparison

Runtime Total latency vs MLX
This model ~1.5s 11x faster
ResembleAI ONNX (Python) ~1.3s 13x faster
CoreML only (our earlier approach) ~5.5s 3x faster
MLX 8-bit (macOS) ~17s baseline

Optimization Journey

Version Latency Key Change
v1: CoreML only 5.5s First working pipeline
v2: + ONNX decode 2.5s Hybrid CoreML prefill + ONNX decode
v3: + C API 2.5s Zero-copy tensors (43ms/tok)
v4: + conditional_decoder ~1.5s Single ONNX call replaces S3+vocoder

Architecture

Two-stage pipeline:

Text β†’ [GPT-2 BPE Tokenizer]
     β†’ [T3 Prefill: CoreML] (speaker + text + speech conditioning)
     β†’ [T3 Decode: ONNX C API] (autoregressive speech token generation)
     β†’ [Conditional Decoder: ONNX] (speech tokens β†’ waveform)
     β†’ 24kHz Audio
  • T3 Prefill (CoreML): Handles the full Chatterbox conditioning β€” speaker projection, text embedding (50k vocab), speech embedding (6.5k vocab), and conditioning prompt tokens
  • T3 Decode (ONNX Runtime C API): Zero-copy KV cache for fast autoregressive generation at 43ms/token
  • Conditional Decoder (ONNX Runtime): Replaces 5 separate stages (S3 encoder, U-Net denoiser, flow matching, vocoder, ISTFT) with a single model call

Model Files

File Size Description
T3Prefill.mlmodelc/ 1.4 GB CoreML GPT-2 prefill with conditioning
onnx/language_model_single.onnx 1.2 GB ONNX GPT-2 decode with KV cache
onnx/conditional_decoder_single.onnx 770 MB ONNX S3+vocoder (tokens β†’ waveform)
speech_emb.npy 26 MB Speech token embedding table
default-conds.safetensors 161 KB Default voice profile
tokenizer.json 3.4 MB GPT-2 BPE tokenizer

Quick Start (Swift)

import ChatterboxCoreML

let model = try await ChatterboxCoreMLModel.load(from: modelDirectory)
let audio = try await model.generate("Hello there.")
// audio is AVAudioPCMBuffer at 24kHz

Requirements

  • macOS 15+ / iOS 18+
  • Apple Silicon (M1/M2/M3/M4)
  • Swift 6.2+
  • onnxruntime-swift-package-manager (1.24+)
  • swift-transformers, swift-safetensors

Voice Cloning

let voice = URL(filePath: "path/to/custom-conds.safetensors")
let audio = try await model.generate("Hello there.", voice: voice)

License

MIT (same as Chatterbox Turbo)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ebrinz/chatterbox-turbo-coreml

Quantized
(10)
this model