Chatterbox Turbo CoreML+ONNX

On-device text-to-speech for Apple Silicon. Hybrid CoreML + ONNX Runtime pipeline — ~1.5s latency, 11x faster than MLX.

Based on ResembleAI/chatterbox-turbo.

Benchmarks (Apple M3 Pro)

Stage	Time	Engine
T3 Prefill (380 tokens)	~170ms	CoreML (CPU+GPU)
T3 Decode (per token)	~43ms	ONNX Runtime C API
Conditional Decoder	~850ms	ONNX Runtime
Total (~20 tokens)	~1.5s
Short phrase	~1.0s

Comparison

Runtime	Total latency	vs MLX
This model	~1.5s	11x faster
ResembleAI ONNX (Python)	~1.3s	13x faster
CoreML only (our earlier approach)	~5.5s	3x faster
MLX 8-bit (macOS)	~17s	baseline

Optimization Journey

Version	Latency	Key Change
v1: CoreML only	5.5s	First working pipeline
v2: + ONNX decode	2.5s	Hybrid CoreML prefill + ONNX decode
v3: + C API	2.5s	Zero-copy tensors (43ms/tok)
v4: + conditional_decoder	~1.5s	Single ONNX call replaces S3+vocoder

Architecture

Two-stage pipeline:

Text → [GPT-2 BPE Tokenizer]
     → [T3 Prefill: CoreML] (speaker + text + speech conditioning)
     → [T3 Decode: ONNX C API] (autoregressive speech token generation)
     → [Conditional Decoder: ONNX] (speech tokens → waveform)
     → 24kHz Audio

T3 Prefill (CoreML): Handles the full Chatterbox conditioning — speaker projection, text embedding (50k vocab), speech embedding (6.5k vocab), and conditioning prompt tokens
T3 Decode (ONNX Runtime C API): Zero-copy KV cache for fast autoregressive generation at 43ms/token
Conditional Decoder (ONNX Runtime): Replaces 5 separate stages (S3 encoder, U-Net denoiser, flow matching, vocoder, ISTFT) with a single model call

Model Files

File	Size	Description
`T3Prefill.mlmodelc/`	1.4 GB	CoreML GPT-2 prefill with conditioning
`onnx/language_model_single.onnx`	1.2 GB	ONNX GPT-2 decode with KV cache
`onnx/conditional_decoder_single.onnx`	770 MB	ONNX S3+vocoder (tokens → waveform)
`speech_emb.npy`	26 MB	Speech token embedding table
`default-conds.safetensors`	161 KB	Default voice profile
`tokenizer.json`	3.4 MB	GPT-2 BPE tokenizer

Quick Start (Swift)

import ChatterboxCoreML

let model = try await ChatterboxCoreMLModel.load(from: modelDirectory)
let audio = try await model.generate("Hello there.")
// audio is AVAudioPCMBuffer at 24kHz

Requirements

macOS 15+ / iOS 18+
Apple Silicon (M1/M2/M3/M4)
Swift 6.2+
onnxruntime-swift-package-manager (1.24+)
swift-transformers, swift-safetensors

Voice Cloning

let voice = URL(filePath: "path/to/custom-conds.safetensors")
let audio = try await model.generate("Hello there.", voice: voice)

License

MIT (same as Chatterbox Turbo)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ebrinz/chatterbox-turbo-coreml

Base model

ResembleAI/chatterbox-turbo

Quantized

(10)

this model