KuangWei Chen commited on
Commit
6aa02b0
·
1 Parent(s): 8ee35eb

Update MOSS-Audio-Tokenizer-Nano model card

Browse files
Files changed (1) hide show
  1. README.md +243 -195
README.md CHANGED
@@ -1,195 +1,243 @@
1
- ---
2
- license: apache-2.0
3
- library_name: transformers
4
- tags:
5
- - audio
6
- - audio-tokenizer
7
- - neural-codec
8
- - moss-tts-family
9
- - MOSS Audio Tokenizer
10
- - speech-tokenizer
11
- - trust-remote-code
12
- ---
13
-
14
- # MossAudioTokenizer
15
-
16
- This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
17
-
18
- **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
19
-
20
- **Key Features:**
21
-
22
- * **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual LFQ quantizer stack, it supports high-fidelity reconstruction across a wide range of bitrates.
23
- * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
24
- * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
25
- * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
26
- * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
27
- * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
28
-
29
- **Summary:**
30
- By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
-
32
- This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
33
- `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
- and loaded with `trust_remote_code=True` when needed.
35
-
36
-
37
- ## Usage
38
-
39
- ### Quickstart
40
-
41
- ```python
42
- import torch
43
- from transformers import AutoModel
44
- import torchaudio
45
-
46
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
47
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
48
-
49
- wav, sr = torchaudio.load('demo/demo_gt.wav')
50
- if sr != model.sampling_rate:
51
- wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
52
- if wav.shape[0] == 1:
53
- wav = wav.repeat(model.config.number_channels, 1)
54
- else:
55
- wav = wav[: model.config.number_channels]
56
- wav = wav.unsqueeze(0)
57
- enc = model.encode(wav, return_dict=True)
58
- print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
59
- dec = model.decode(enc.audio_codes, return_dict=True)
60
- print(f"dec.audio.shape: {dec.audio.shape}")
61
- wav = dec.audio.squeeze(0)
62
- torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
63
-
64
- # Decode using only the first 8 layers of the RVQ
65
- dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
66
- wav_rvq8 = dec_rvq8.audio.squeeze(0)
67
- torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
68
- ```
69
-
70
- ### Attention Backend And Compute Dtype
71
-
72
- `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
73
- `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.
74
-
75
- ```python
76
- model.set_attention_implementation("flash_attention_2")
77
- model.set_compute_dtype("fp16")
78
- ```
79
-
80
- The quantizer always runs in fp32.
81
-
82
- ### Streaming
83
-
84
- `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a
85
- `chunk_duration` argument.
86
-
87
- - `chunk_duration` is expressed in seconds.
88
- - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
89
- - Streaming batch inference is supported.
90
- - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.
91
-
92
- ```python
93
- import torch
94
- from transformers import AutoModel
95
-
96
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
97
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
98
- audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
99
-
100
- # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
101
- enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
102
- dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
103
-
104
- batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
105
- codes_list = [
106
- batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
107
- for i in range(batch_enc.audio_codes.shape[1])
108
- ]
109
- batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
110
- ```
111
-
112
- #### Continuous Batch Streaming Decode
113
-
114
- For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`.
115
-
116
- - The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the
117
- fixed-slot decoder budget for that public stream.
118
- - Same-size calls continue the existing logical rows in-order.
119
- - If a later call is larger, the new rows are admitted by tail append.
120
- - `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the
121
- pre-call logical order.
122
- - After a finalize call returns, the next streaming call may use the smaller survivor batch.
123
- - `reset_stream=True` discards the hidden public streaming state and starts a fresh stream.
124
-
125
- Milestone 1 boundaries:
126
-
127
- - decode-only continuous batching
128
- - one active streaming decode state per model instance
129
- - fixed-slot decoder reservation from `max_batch_size`
130
- - no encode-side continuous batching
131
- - no physical compaction of surviving decode slots
132
- - no multi-session concurrency on one model instance
133
-
134
- ```python
135
- import torch
136
- from transformers import AutoModel
137
-
138
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
139
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
140
- num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
141
-
142
- codes_a0 = torch.randint(0, 8, (num_quantizers, 2))
143
- codes_b0 = torch.randint(0, 8, (num_quantizers, 3))
144
- codes_a1 = torch.randint(0, 8, (num_quantizers, 2))
145
- codes_b1 = torch.randint(0, 8, (num_quantizers, 2))
146
- codes_c0 = torch.randint(0, 8, (num_quantizers, 1))
147
- codes_a2 = torch.randint(0, 8, (num_quantizers, 1))
148
- codes_b2 = torch.randint(0, 8, (num_quantizers, 2))
149
- codes_c1 = torch.randint(0, 8, (num_quantizers, 2))
150
- codes_b3 = torch.randint(0, 8, (num_quantizers, 1))
151
- codes_c2 = torch.randint(0, 8, (num_quantizers, 1))
152
-
153
- # First call reserves 3 fixed decoder slots for A and B.
154
- out_ab0 = model.batch_decode(
155
- [codes_a0, codes_b0],
156
- streaming=True,
157
- max_batch_size=3,
158
- reset_stream=True,
159
- )
160
-
161
- # Same logical rows continue in-order; C is a tail append.
162
- out_abc1 = model.batch_decode(
163
- [codes_a1, codes_b1, codes_c0],
164
- streaming=True,
165
- )
166
-
167
- # Finalize A against the pre-call logical order. A still decodes in this call,
168
- # then is evicted immediately afterward.
169
- out_abc2 = model.batch_decode(
170
- [codes_a2, codes_b2, codes_c1],
171
- streaming=True,
172
- finalize_indices=[0],
173
- )
174
-
175
- # The next call can shrink to the surviving logical rows only.
176
- out_bc3 = model.batch_decode(
177
- [codes_b3, codes_c2],
178
- streaming=True,
179
- )
180
- ```
181
-
182
- ## Repository layout
183
-
184
- - `configuration_moss_audio_tokenizer.py`
185
- - `modeling_moss_audio_tokenizer.py`
186
- - `__init__.py`
187
- - `config.json`
188
- - model weights
189
-
190
-
191
- ## Citation
192
- If you use this code or result in your paper, please cite our work as:
193
- ```tex
194
-
195
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - audio
6
+ - audio-tokenizer
7
+ - neural-codec
8
+ - moss-tts-family
9
+ - MOSS Audio Tokenizer Nano
10
+ - speech-tokenizer
11
+ - trust-remote-code
12
+ ---
13
+
14
+ # MOSS-Audio-Tokenizer-Nano
15
+
16
+ This repository contains the Hugging Face remote-code implementation and weights for **MOSS-Audio-Tokenizer-Nano**, the lightweight audio tokenizer used by **MOSS-TTS-Nano**.
17
+
18
+ MOSS-Audio-Tokenizer-Nano is a compact discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture from [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934). The checkpoint in this repository has **21,969,664 parameters** (approximately **22M**), making it much smaller than the full-size MOSS-Audio-Tokenizer while preserving the 48 kHz stereo tokenizer interface used by the MOSS-TTS family.
19
+
20
+ ## Key Features
21
+
22
+ - **Small model size**: approximately **22M parameters**, including about 10.45M encoder parameters, 10.45M decoder parameters, and 1.07M quantizer parameters.
23
+ - **Native high-resolution audio**: supports **48 kHz** input and output with **2-channel stereo** audio, helping reduce compression loss and improve listening quality.
24
+ - **Low-frame-rate discrete codes**: compresses 48 kHz stereo audio into a **12.5 Hz** token stream with a downsample rate of 7,680 samples.
25
+ - **Variable bitrate reconstruction**: uses a residual quantizer stack with **16 codebooks** and 1,024 entries per codebook. Each codebook contributes about **0.125 kbps**, for an inference range from **0.125 kbps to 2 kbps**.
26
+ - **Transformer-based tokenizer**: uses causal Transformer blocks and supports low-latency streaming encode/decode.
27
+ - **MOSS-TTS family interface**: designed as the audio tokenizer backbone for MOSS-TTS-Nano and compatible MOSS-TTS-family workflows.
28
+
29
+ **Summary:**
30
+ By combining a compact causal Transformer tokenizer with native 48 kHz stereo modeling, MOSS-Audio-Tokenizer-Nano reduces the deployment cost of the MOSS audio tokenizer interface while keeping high-fidelity reconstruction for speech, general audio, and music. It provides a lightweight, low-frame-rate, and streaming-friendly discrete audio representation for MOSS-TTS-Nano and other real-time speech generation workflows.
31
+
32
+ This repository contains a lightweight remote-code implementation that mirrors the current Hugging Face Transformers `transformers.models.moss_audio_tokenizer` module. Load it with `trust_remote_code=True` when needed.
33
+
34
+ ## Evaluation Metrics
35
+
36
+ The table below compares the reconstruction quality of MOSS-Audio-Tokenizer-Nano with open-source audio tokenizers with **no more than 120M parameters** on speech, audio, and music data. MOSS-Audio-Tokenizer-Nano keeps one of the smallest model sizes in the comparison while supporting **48 kHz stereo** reconstruction.
37
+
38
+ - Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
39
+ - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
40
+ - STFT-Dist. denotes the STFT distance.
41
+ - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
42
+ - Ch. denotes the number of input/output channels supported by the audio tokenizer: `ch=1` means mono audio, and `ch=2` means stereo audio.
43
+ - Nvq denotes the number of quantizers.
44
+
45
+ | Model | Params (M) | Sample rate | Ch. | bps | Nvq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
46
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
47
+ | **Mimi VAE** | 28 | 24k | 1 | -- | -- | 0.75 / 0.54 | 0.91 / 0.83 | 2.92 / 2.20 | 2.30 / 1.73 | 1.35 / 1.31 | 2.70 / 2.59 |
48
+ | **DAC** | 77 | 44.1k | 1 | 861 | 1 | 0.30 / 0.20 | 0.76 / 0.68 | 1.55 / 1.36 | 1.24 / 1.15 | 1.25 / 1.18 | 2.71 / 2.54 |
49
+ | **SpeechTokenizer** | 120 | 16k | 1 | 1000 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
50
+ | **Mimi** | 96 | 24k | 1 | 1100 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
51
+ | **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 750 | 6 | 0.64 / 0.61 | 0.90 / 0.85 | 2.65 / 2.28 | 2.11 / 1.87 | 1.04 / 1.01 | 2.42 / 2.27 |
52
+ | **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 1000 | 8 | **0.75 / 0.69** | **0.92 / 0.87** | **2.92 / 2.48** | **2.36 / 2.04** | **1.00 / 0.97** | **2.37 / 2.22** |
53
+ | **EnCodec** | 19 | 48k | 2 | 1500 | 1 | 0.35 / 0.30 | 0.76 / 0.75 | 1.54 / 1.60 | 1.25 / 1.32 | 1.25 / 1.05 | 2.73 / 2.30 |
54
+ | **SpeechTokenizer** | 120 | 16k | 1 | 1500 | 3 | 0.52 / 0.38 | 0.84 / 0.75 | 2.00 / 1.60 | 1.57 / 1.33 | -- / -- | -- / -- |
55
+ | **Mimi** | 96 | 24k | 1 | 1512.5 | 11 | 0.82 / 0.67 | 0.92 / 0.88 | 3.10 / 2.50 | 2.54 / 2.00 | 1.19 / 1.14 | 2.55 / 2.42 |
56
+ | **DAC** | 77 | 44.1k | 1 | 1723 | 2 | 0.57 / 0.47 | 0.86 / 0.80 | 2.21 / 1.85 | 1.74 / 1.49 | 1.03 / 0.99 | 2.43 / 2.26 |
57
+ | **SpeechTokenizer** | 120 | 16k | 1 | 2000 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
58
+ | **Mimi** | 96 | 24k | 1 | 2062.5 | 15 | 0.87 / 0.73 | 0.94 / 0.90 | 3.36 / 2.76 | 2.81 / 2.22 | 1.14 / 1.09 | 2.49 / 2.36 |
59
+ | **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 1500 | 12 | 0.84 / 0.77 | 0.94 / 0.90 | 3.25 / 2.77 | 2.71 / 2.31 | 0.95 / 0.91 | 2.31 / 2.14 |
60
+ | **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 2000 | 16 | **0.88 / 0.81** | **0.95 / 0.91** | **3.40 / 2.93** | **2.89 / 2.47** | **0.93 / 0.89** | **2.28 / 2.11** |
61
+
62
+ ## Usage
63
+
64
+ ### Quickstart
65
+
66
+ ```python
67
+ import torchaudio
68
+ from transformers import AutoModel
69
+
70
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
71
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
72
+
73
+ wav, sr = torchaudio.load("demo/demo_gt.wav")
74
+ if sr != model.sampling_rate:
75
+ wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
76
+
77
+ # The public waveform interface expects stereo audio.
78
+ if wav.shape[0] == 1:
79
+ wav = wav.repeat(model.config.number_channels, 1)
80
+ else:
81
+ wav = wav[: model.config.number_channels]
82
+
83
+ wav = wav.unsqueeze(0)
84
+ enc = model.encode(wav, return_dict=True)
85
+ print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
86
+
87
+ dec = model.decode(enc.audio_codes, return_dict=True)
88
+ print(f"dec.audio.shape: {dec.audio.shape}")
89
+
90
+ wav = dec.audio.squeeze(0)
91
+ torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
92
+
93
+ # Decode with the first 8 codebooks, roughly 1 kbps.
94
+ dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
95
+ wav_rvq8 = dec_rvq8.audio.squeeze(0)
96
+ torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
97
+ ```
98
+
99
+ ### Attention Backend And Compute Dtype
100
+
101
+ `config.attention_implementation` controls whether Transformer layers prefer `sdpa` or `flash_attention_2`.
102
+ `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.
103
+
104
+ ```python
105
+ model.set_attention_implementation("flash_attention_2")
106
+ model.set_compute_dtype("fp16")
107
+ ```
108
+
109
+ The quantizer always runs in fp32.
110
+
111
+ ### Streaming
112
+
113
+ `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a `chunk_duration` argument.
114
+
115
+ - `chunk_duration` is expressed in seconds.
116
+ - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
117
+ - Streaming batch inference is supported.
118
+ - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.
119
+
120
+ ```python
121
+ import torch
122
+ from transformers import AutoModel
123
+
124
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
125
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
126
+ audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
127
+
128
+ # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
129
+ enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
130
+ dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
131
+
132
+ batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
133
+ codes_list = [
134
+ batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
135
+ for i in range(batch_enc.audio_codes.shape[1])
136
+ ]
137
+ batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
138
+ ```
139
+
140
+ #### Continuous Batch Streaming Decode
141
+
142
+ For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`.
143
+
144
+ - The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the fixed-slot decoder budget for that public stream.
145
+ - Same-size calls continue the existing logical rows in order.
146
+ - If a later call is larger, the new rows are admitted by tail append.
147
+ - `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the pre-call logical order.
148
+ - After a finalize call returns, the next streaming call may use the smaller survivor batch.
149
+ - `reset_stream=True` discards the hidden public streaming state and starts a fresh stream.
150
+
151
+ Milestone 1 boundaries:
152
+
153
+ - decode-only continuous batching
154
+ - one active streaming decode state per model instance
155
+ - fixed-slot decoder reservation from `max_batch_size`
156
+ - no encode-side continuous batching
157
+ - no physical compaction of surviving decode slots
158
+ - no multi-session concurrency on one model instance
159
+
160
+ ```python
161
+ import torch
162
+ from transformers import AutoModel
163
+
164
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
165
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
166
+ num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
167
+ codebook_size = model.config.quantizer_kwargs["codebook_size"]
168
+
169
+ codes_a0 = torch.randint(0, codebook_size, (num_quantizers, 2))
170
+ codes_b0 = torch.randint(0, codebook_size, (num_quantizers, 3))
171
+ codes_a1 = torch.randint(0, codebook_size, (num_quantizers, 2))
172
+ codes_b1 = torch.randint(0, codebook_size, (num_quantizers, 2))
173
+ codes_c0 = torch.randint(0, codebook_size, (num_quantizers, 1))
174
+ codes_a2 = torch.randint(0, codebook_size, (num_quantizers, 1))
175
+ codes_b2 = torch.randint(0, codebook_size, (num_quantizers, 2))
176
+ codes_c1 = torch.randint(0, codebook_size, (num_quantizers, 2))
177
+ codes_b3 = torch.randint(0, codebook_size, (num_quantizers, 1))
178
+ codes_c2 = torch.randint(0, codebook_size, (num_quantizers, 1))
179
+
180
+ # First call reserves 3 fixed decoder slots for A and B.
181
+ out_ab0 = model.batch_decode(
182
+ [codes_a0, codes_b0],
183
+ streaming=True,
184
+ max_batch_size=3,
185
+ reset_stream=True,
186
+ )
187
+
188
+ # Same logical rows continue in order; C is a tail append.
189
+ out_abc1 = model.batch_decode(
190
+ [codes_a1, codes_b1, codes_c0],
191
+ streaming=True,
192
+ )
193
+
194
+ # Finalize A against the pre-call logical order. A still decodes in this call,
195
+ # then is evicted immediately afterward.
196
+ out_abc2 = model.batch_decode(
197
+ [codes_a2, codes_b2, codes_c1],
198
+ streaming=True,
199
+ finalize_indices=[0],
200
+ )
201
+
202
+ # The next call can shrink to the surviving logical rows only.
203
+ out_bc3 = model.batch_decode(
204
+ [codes_b3, codes_c2],
205
+ streaming=True,
206
+ )
207
+ ```
208
+
209
+ ## Repository Layout
210
+
211
+ - `configuration_moss_audio_tokenizer.py`
212
+ - `modeling_moss_audio_tokenizer.py`
213
+ - `__init__.py`
214
+ - `config.json`
215
+ - model weights
216
+
217
+ ## Citation
218
+
219
+ If you use this model or code in your work, please cite:
220
+
221
+ ```bibtex
222
+ @misc{gong2026mossttstechnicalreport,
223
+ title={MOSS-TTS Technical Report},
224
+ author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
225
+ year={2026},
226
+ eprint={2603.18090},
227
+ archivePrefix={arXiv},
228
+ primaryClass={cs.SD},
229
+ url={https://arxiv.org/abs/2603.18090}
230
+ }
231
+ ```
232
+
233
+ ```bibtex
234
+ @misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
235
+ title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
236
+ author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
237
+ year={2026},
238
+ eprint={2602.10934},
239
+ archivePrefix={arXiv},
240
+ primaryClass={cs.SD},
241
+ url={https://arxiv.org/abs/2602.10934}
242
+ }
243
+ ```