Chatterbox Turbo CoreML+ONNX
On-device text-to-speech for Apple Silicon. Hybrid CoreML + ONNX Runtime pipeline β ~1.5s latency, 11x faster than MLX.
Based on ResembleAI/chatterbox-turbo.
Benchmarks (Apple M3 Pro)
| Stage | Time | Engine |
|---|---|---|
| T3 Prefill (380 tokens) | ~170ms | CoreML (CPU+GPU) |
| T3 Decode (per token) | ~43ms | ONNX Runtime C API |
| Conditional Decoder | ~850ms | ONNX Runtime |
| Total (~20 tokens) | ~1.5s | |
| Short phrase | ~1.0s |
Comparison
| Runtime | Total latency | vs MLX |
|---|---|---|
| This model | ~1.5s | 11x faster |
| ResembleAI ONNX (Python) | ~1.3s | 13x faster |
| CoreML only (our earlier approach) | ~5.5s | 3x faster |
| MLX 8-bit (macOS) | ~17s | baseline |
Optimization Journey
| Version | Latency | Key Change |
|---|---|---|
| v1: CoreML only | 5.5s | First working pipeline |
| v2: + ONNX decode | 2.5s | Hybrid CoreML prefill + ONNX decode |
| v3: + C API | 2.5s | Zero-copy tensors (43ms/tok) |
| v4: + conditional_decoder | ~1.5s | Single ONNX call replaces S3+vocoder |
Architecture
Two-stage pipeline:
Text β [GPT-2 BPE Tokenizer]
β [T3 Prefill: CoreML] (speaker + text + speech conditioning)
β [T3 Decode: ONNX C API] (autoregressive speech token generation)
β [Conditional Decoder: ONNX] (speech tokens β waveform)
β 24kHz Audio
- T3 Prefill (CoreML): Handles the full Chatterbox conditioning β speaker projection, text embedding (50k vocab), speech embedding (6.5k vocab), and conditioning prompt tokens
- T3 Decode (ONNX Runtime C API): Zero-copy KV cache for fast autoregressive generation at 43ms/token
- Conditional Decoder (ONNX Runtime): Replaces 5 separate stages (S3 encoder, U-Net denoiser, flow matching, vocoder, ISTFT) with a single model call
Model Files
| File | Size | Description |
|---|---|---|
T3Prefill.mlmodelc/ |
1.4 GB | CoreML GPT-2 prefill with conditioning |
onnx/language_model_single.onnx |
1.2 GB | ONNX GPT-2 decode with KV cache |
onnx/conditional_decoder_single.onnx |
770 MB | ONNX S3+vocoder (tokens β waveform) |
speech_emb.npy |
26 MB | Speech token embedding table |
default-conds.safetensors |
161 KB | Default voice profile |
tokenizer.json |
3.4 MB | GPT-2 BPE tokenizer |
Quick Start (Swift)
import ChatterboxCoreML
let model = try await ChatterboxCoreMLModel.load(from: modelDirectory)
let audio = try await model.generate("Hello there.")
// audio is AVAudioPCMBuffer at 24kHz
Requirements
- macOS 15+ / iOS 18+
- Apple Silicon (M1/M2/M3/M4)
- Swift 6.2+
onnxruntime-swift-package-manager(1.24+)swift-transformers,swift-safetensors
Voice Cloning
let voice = URL(filePath: "path/to/custom-conds.safetensors")
let audio = try await model.generate("Hello there.", voice: voice)
License
MIT (same as Chatterbox Turbo)
Model tree for ebrinz/chatterbox-turbo-coreml
Base model
ResembleAI/chatterbox-turbo