Upload MOSS-TTSD NF4 quantized model
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +24 -0
- .gitignore +214 -0
- .gitmodules +3 -0
- AMPERE_OPTIMIZATION_PLAN.md +181 -0
- LICENSE +201 -0
- MANIFEST.in +22 -0
- OPTIMIZATION_REPORT.md +296 -0
- README.md +548 -0
- README_zh.md +534 -0
- assets/OpenMOSS_Logo.png +0 -0
- assets/VS_Open-Source_Models.jpg +3 -0
- assets/VS_Proprietary_Models.png +3 -0
- assets/arch_moss_audio_tokenizer.png +3 -0
- assets/archi_delay.png +3 -0
- assets/archi_local.png +3 -0
- assets/audio/reference_02_s1.wav +3 -0
- assets/audio/reference_02_s2.wav +3 -0
- assets/audio/reference_en.m4a +0 -0
- assets/audio/reference_en_0.mp3 +0 -0
- assets/audio/reference_en_1.mp3 +3 -0
- assets/audio/reference_en_2.mp3 +3 -0
- assets/audio/reference_en_3.mp3 +3 -0
- assets/audio/reference_zh.wav +3 -0
- assets/audio/reference_zh_0.wav +3 -0
- assets/audio/reference_zh_1.wav +3 -0
- assets/audio/reference_zh_2.wav +3 -0
- assets/audio/reference_zh_3.mp3 +3 -0
- assets/evaluation_moss_audio_tokenizer.png +3 -0
- assets/mosi-logo.png +0 -0
- assets/moss_tts_family.jpeg +3 -0
- assets/moss_tts_realtime.jpeg +3 -0
- assets/moss_voice_generator_winrate.png +3 -0
- assets/text/moss_tts_example_texts.jsonl +8 -0
- assets/text/moss_voice_generator_example_texts.jsonl +8 -0
- assets/wechat.jpg +0 -0
- benchmark_harness.py +310 -0
- clis/moss_sound_effect_app.py +347 -0
- clis/moss_tts_app.py +621 -0
- clis/moss_ttsd_app.py +811 -0
- clis/moss_voice_generator_app.py +410 -0
- configs/llama_cpp/cpu-only.yaml +45 -0
- configs/llama_cpp/default.yaml +70 -0
- configs/llama_cpp/trt-8gb.yaml +68 -0
- configs/llama_cpp/trt.yaml +54 -0
- docs/moss_sound_effect_model_card.md +142 -0
- docs/moss_tts_model_card.md +427 -0
- docs/moss_tts_realtime_model_card.md +213 -0
- docs/moss_ttsd_model_card.md +250 -0
- docs/moss_voice_generator_model_card.md +161 -0
- moss_tts_delay/README.md +90 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,27 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
assets/VS_Open-Source_Models.jpg filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
assets/VS_Proprietary_Models.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
assets/arch_moss_audio_tokenizer.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
assets/archi_delay.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
assets/archi_local.png filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
assets/audio/reference_02_s1.wav filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
assets/audio/reference_02_s2.wav filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
assets/audio/reference_en_1.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
assets/audio/reference_en_2.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
assets/audio/reference_en_3.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
assets/audio/reference_zh.wav filter=lfs diff=lfs merge=lfs -text
|
| 47 |
+
assets/audio/reference_zh_0.wav filter=lfs diff=lfs merge=lfs -text
|
| 48 |
+
assets/audio/reference_zh_1.wav filter=lfs diff=lfs merge=lfs -text
|
| 49 |
+
assets/audio/reference_zh_2.wav filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
assets/audio/reference_zh_3.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 51 |
+
assets/evaluation_moss_audio_tokenizer.png filter=lfs diff=lfs merge=lfs -text
|
| 52 |
+
assets/moss_tts_family.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 53 |
+
assets/moss_tts_realtime.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 54 |
+
assets/moss_voice_generator_winrate.png filter=lfs diff=lfs merge=lfs -text
|
| 55 |
+
moss_tts_realtime/audio/prompt_audio.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 56 |
+
moss_tts_realtime/audio/prompt_audio1.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 57 |
+
moss_tts_realtime/audio/user1.wav filter=lfs diff=lfs merge=lfs -text
|
| 58 |
+
moss_tts_realtime/audio/user2.wav filter=lfs diff=lfs merge=lfs -text
|
| 59 |
+
output_nf4/0_0.wav filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Byte-compiled / optimized / DLL files
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[codz]
|
| 4 |
+
*$py.class
|
| 5 |
+
|
| 6 |
+
# C extensions
|
| 7 |
+
*.so
|
| 8 |
+
|
| 9 |
+
# Distribution / packaging
|
| 10 |
+
.Python
|
| 11 |
+
build/
|
| 12 |
+
develop-eggs/
|
| 13 |
+
dist/
|
| 14 |
+
downloads/
|
| 15 |
+
eggs/
|
| 16 |
+
.eggs/
|
| 17 |
+
lib/
|
| 18 |
+
lib64/
|
| 19 |
+
parts/
|
| 20 |
+
sdist/
|
| 21 |
+
var/
|
| 22 |
+
wheels/
|
| 23 |
+
share/python-wheels/
|
| 24 |
+
*.egg-info/
|
| 25 |
+
.installed.cfg
|
| 26 |
+
*.egg
|
| 27 |
+
MANIFEST
|
| 28 |
+
|
| 29 |
+
# PyInstaller
|
| 30 |
+
# Usually these files are written by a python script from a template
|
| 31 |
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
| 32 |
+
*.manifest
|
| 33 |
+
*.spec
|
| 34 |
+
|
| 35 |
+
# Installer logs
|
| 36 |
+
pip-log.txt
|
| 37 |
+
pip-delete-this-directory.txt
|
| 38 |
+
|
| 39 |
+
# Unit test / coverage reports
|
| 40 |
+
htmlcov/
|
| 41 |
+
.tox/
|
| 42 |
+
.nox/
|
| 43 |
+
.coverage
|
| 44 |
+
.coverage.*
|
| 45 |
+
.cache
|
| 46 |
+
nosetests.xml
|
| 47 |
+
coverage.xml
|
| 48 |
+
*.cover
|
| 49 |
+
*.py.cover
|
| 50 |
+
.hypothesis/
|
| 51 |
+
.pytest_cache/
|
| 52 |
+
cover/
|
| 53 |
+
|
| 54 |
+
# Translations
|
| 55 |
+
*.mo
|
| 56 |
+
*.pot
|
| 57 |
+
|
| 58 |
+
# Django stuff:
|
| 59 |
+
*.log
|
| 60 |
+
local_settings.py
|
| 61 |
+
db.sqlite3
|
| 62 |
+
db.sqlite3-journal
|
| 63 |
+
|
| 64 |
+
# Flask stuff:
|
| 65 |
+
instance/
|
| 66 |
+
.webassets-cache
|
| 67 |
+
|
| 68 |
+
# Scrapy stuff:
|
| 69 |
+
.scrapy
|
| 70 |
+
|
| 71 |
+
# Sphinx documentation
|
| 72 |
+
docs/_build/
|
| 73 |
+
|
| 74 |
+
# PyBuilder
|
| 75 |
+
.pybuilder/
|
| 76 |
+
target/
|
| 77 |
+
|
| 78 |
+
# Jupyter Notebook
|
| 79 |
+
.ipynb_checkpoints
|
| 80 |
+
|
| 81 |
+
# IPython
|
| 82 |
+
profile_default/
|
| 83 |
+
ipython_config.py
|
| 84 |
+
|
| 85 |
+
# pyenv
|
| 86 |
+
# For a library or package, you might want to ignore these files since the code is
|
| 87 |
+
# intended to run in multiple environments; otherwise, check them in:
|
| 88 |
+
# .python-version
|
| 89 |
+
|
| 90 |
+
# pipenv
|
| 91 |
+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
| 92 |
+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
| 93 |
+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
| 94 |
+
# install all needed dependencies.
|
| 95 |
+
#Pipfile.lock
|
| 96 |
+
|
| 97 |
+
# UV
|
| 98 |
+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
|
| 99 |
+
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
| 100 |
+
# commonly ignored for libraries.
|
| 101 |
+
#uv.lock
|
| 102 |
+
|
| 103 |
+
# poetry
|
| 104 |
+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
| 105 |
+
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
| 106 |
+
# commonly ignored for libraries.
|
| 107 |
+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
| 108 |
+
#poetry.lock
|
| 109 |
+
#poetry.toml
|
| 110 |
+
|
| 111 |
+
# pdm
|
| 112 |
+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
| 113 |
+
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
|
| 114 |
+
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
|
| 115 |
+
#pdm.lock
|
| 116 |
+
#pdm.toml
|
| 117 |
+
.pdm-python
|
| 118 |
+
.pdm-build/
|
| 119 |
+
|
| 120 |
+
# pixi
|
| 121 |
+
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
|
| 122 |
+
#pixi.lock
|
| 123 |
+
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
|
| 124 |
+
# in the .venv directory. It is recommended not to include this directory in version control.
|
| 125 |
+
.pixi
|
| 126 |
+
|
| 127 |
+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
| 128 |
+
__pypackages__/
|
| 129 |
+
|
| 130 |
+
# Celery stuff
|
| 131 |
+
celerybeat-schedule
|
| 132 |
+
celerybeat.pid
|
| 133 |
+
|
| 134 |
+
# SageMath parsed files
|
| 135 |
+
*.sage.py
|
| 136 |
+
|
| 137 |
+
# Environments
|
| 138 |
+
.env
|
| 139 |
+
.envrc
|
| 140 |
+
.venv
|
| 141 |
+
env/
|
| 142 |
+
venv/
|
| 143 |
+
ENV/
|
| 144 |
+
env.bak/
|
| 145 |
+
venv.bak/
|
| 146 |
+
.sii
|
| 147 |
+
.cursor
|
| 148 |
+
.sisyphus
|
| 149 |
+
|
| 150 |
+
# Spyder project settings
|
| 151 |
+
.spyderproject
|
| 152 |
+
.spyproject
|
| 153 |
+
|
| 154 |
+
# Rope project settings
|
| 155 |
+
.ropeproject
|
| 156 |
+
|
| 157 |
+
# mkdocs documentation
|
| 158 |
+
/site
|
| 159 |
+
|
| 160 |
+
# mypy
|
| 161 |
+
.mypy_cache/
|
| 162 |
+
.dmypy.json
|
| 163 |
+
dmypy.json
|
| 164 |
+
|
| 165 |
+
# Pyre type checker
|
| 166 |
+
.pyre/
|
| 167 |
+
|
| 168 |
+
# pytype static type analyzer
|
| 169 |
+
.pytype/
|
| 170 |
+
|
| 171 |
+
# Cython debug symbols
|
| 172 |
+
cython_debug/
|
| 173 |
+
|
| 174 |
+
# PyCharm
|
| 175 |
+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
| 176 |
+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
| 177 |
+
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
| 178 |
+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
| 179 |
+
#.idea/
|
| 180 |
+
|
| 181 |
+
# Abstra
|
| 182 |
+
# Abstra is an AI-powered process automation framework.
|
| 183 |
+
# Ignore directories containing user credentials, local state, and settings.
|
| 184 |
+
# Learn more at https://abstra.io/docs
|
| 185 |
+
.abstra/
|
| 186 |
+
|
| 187 |
+
# Visual Studio Code
|
| 188 |
+
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
|
| 189 |
+
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
|
| 190 |
+
# and can be added to the global gitignore or merged into this file. However, if you prefer,
|
| 191 |
+
# you could uncomment the following to ignore the entire vscode folder
|
| 192 |
+
# .vscode/
|
| 193 |
+
|
| 194 |
+
# Ruff stuff:
|
| 195 |
+
.ruff_cache/
|
| 196 |
+
|
| 197 |
+
# PyPI configuration file
|
| 198 |
+
.pypirc
|
| 199 |
+
|
| 200 |
+
# Cursor
|
| 201 |
+
# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
|
| 202 |
+
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
|
| 203 |
+
# refer to https://docs.cursor.com/context/ignore-files
|
| 204 |
+
.cursorignore
|
| 205 |
+
.cursorindexingignore
|
| 206 |
+
|
| 207 |
+
# Marimo
|
| 208 |
+
marimo/_static/
|
| 209 |
+
marimo/_lsp/
|
| 210 |
+
__marimo__/
|
| 211 |
+
dev/*
|
| 212 |
+
|
| 213 |
+
# Weights
|
| 214 |
+
weights
|
.gitmodules
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[submodule "moss_audio_tokenizer"]
|
| 2 |
+
path = moss_audio_tokenizer
|
| 3 |
+
url = https://github.com/OpenMOSS/MOSS-Audio-Tokenizer
|
AMPERE_OPTIMIZATION_PLAN.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ampere Optimization Plan for MOSS-TTS Realtime
|
| 2 |
+
|
| 3 |
+
## Current baseline
|
| 4 |
+
|
| 5 |
+
- Current service runs on an RTX 3060 (12 GB) and holds about 8.5 GB of VRAM at idle with the model loaded.
|
| 6 |
+
- The backbone already uses SDPA plus `torch.compile(mode="default", dynamic=True)`.
|
| 7 |
+
- The local transformer already uses `StaticCache` plus `torch.compile(..., fullgraph=True)`.
|
| 8 |
+
- The OpenAI-compatible API is single-request by default via `MOSS_TTS_MAX_CONCURRENT=1`.
|
| 9 |
+
- The realtime decode bridge still supports `batch_size=1`, so this is not a batched serving system yet.
|
| 10 |
+
- Multiple RTX 3090 GPUs are currently idle, which makes a GPU move the highest-confidence near-term optimization.
|
| 11 |
+
|
| 12 |
+
## Goals
|
| 13 |
+
|
| 14 |
+
- Reduce time-to-first-byte (TTFB) and real-time factor (RTF) on Ampere.
|
| 15 |
+
- Increase throughput without breaking streaming stability or audio quality.
|
| 16 |
+
- Spend engineering time in the order of highest confidence and lowest risk.
|
| 17 |
+
|
| 18 |
+
## Ranked plan
|
| 19 |
+
|
| 20 |
+
### 1. Move the service to an RTX 3090 and re-benchmark
|
| 21 |
+
|
| 22 |
+
Why this comes first:
|
| 23 |
+
|
| 24 |
+
- It is the easiest change.
|
| 25 |
+
- It has the highest confidence of producing a real improvement.
|
| 26 |
+
- It adds memory headroom for concurrency experiments without changing the model path.
|
| 27 |
+
|
| 28 |
+
Actions:
|
| 29 |
+
|
| 30 |
+
- Launch the service on one idle RTX 3090 by overriding `MOSS_TTS_GPU` in `start_moss_tts.sh` or at launch time.
|
| 31 |
+
- Keep the current SDPA plus compile configuration unchanged for the first comparison.
|
| 32 |
+
- Re-run the same short, medium, and long prompt benchmark set used in the optimization report.
|
| 33 |
+
|
| 34 |
+
Success criteria:
|
| 35 |
+
|
| 36 |
+
- Record TTFB, RTF, startup time, peak VRAM, and steady-state VRAM on the 3090.
|
| 37 |
+
- Keep this move if it gives a measurable latency win or enables higher safe concurrency.
|
| 38 |
+
|
| 39 |
+
### 2. Profile where time is actually going
|
| 40 |
+
|
| 41 |
+
Why this comes second:
|
| 42 |
+
|
| 43 |
+
- The next bottleneck might be the backbone, the local transformer, codec decode, or Python-side orchestration.
|
| 44 |
+
- Future work should be driven by timing data, not guesses.
|
| 45 |
+
|
| 46 |
+
Actions:
|
| 47 |
+
|
| 48 |
+
- Add timing or profiler markers around:
|
| 49 |
+
- model prefill
|
| 50 |
+
- backbone decode step loop
|
| 51 |
+
- local transformer generation
|
| 52 |
+
- codec decode
|
| 53 |
+
- audio packaging and response streaming
|
| 54 |
+
- Capture timing breakdowns for at least three prompt sizes on both the 3060 and one 3090.
|
| 55 |
+
|
| 56 |
+
Success criteria:
|
| 57 |
+
|
| 58 |
+
- Produce a table showing percentage of end-to-end latency spent in each stage.
|
| 59 |
+
- Use that table to decide whether the next optimization target is the backbone or the codec.
|
| 60 |
+
|
| 61 |
+
### 3. Increase throughput with controlled concurrency on the 3090
|
| 62 |
+
|
| 63 |
+
Why this is next:
|
| 64 |
+
|
| 65 |
+
- The code already gates requests with a semaphore.
|
| 66 |
+
- The 3090 has enough memory headroom to test more than one request in flight.
|
| 67 |
+
- This is a simpler win than trying to force true batching into the realtime decode path.
|
| 68 |
+
|
| 69 |
+
Actions:
|
| 70 |
+
|
| 71 |
+
- Keep decode `batch_size=1`.
|
| 72 |
+
- Experiment with `MOSS_TTS_MAX_CONCURRENT=2`, then `3`, while monitoring latency and memory.
|
| 73 |
+
- Measure p50 and p95 TTFB and end-to-end latency under concurrent load.
|
| 74 |
+
|
| 75 |
+
Success criteria:
|
| 76 |
+
|
| 77 |
+
- Increase completed requests per minute without unacceptable p95 latency growth.
|
| 78 |
+
- Do not keep a higher concurrency setting if it causes GPU memory pressure, queue thrash, or unstable streaming.
|
| 79 |
+
|
| 80 |
+
### 4. Wire the codec to ONNX or TensorRT only if profiling shows it matters
|
| 81 |
+
|
| 82 |
+
Why this is conditional:
|
| 83 |
+
|
| 84 |
+
- The repo already contains ONNX and TensorRT codec backends.
|
| 85 |
+
- The realtime path does not appear to use them today.
|
| 86 |
+
- This is worthwhile only if codec decode is a meaningful share of total latency.
|
| 87 |
+
|
| 88 |
+
Actions:
|
| 89 |
+
|
| 90 |
+
- If profiling shows codec decode is a major bottleneck, add backend selection to the realtime serving path.
|
| 91 |
+
- Export or build the codec backend for the target Ampere GPU.
|
| 92 |
+
- Compare PyTorch, ONNX, and TensorRT codec latency and audio parity.
|
| 93 |
+
|
| 94 |
+
Success criteria:
|
| 95 |
+
|
| 96 |
+
- Keep the integration only if it provides a meaningful end-to-end win or raises safe concurrency.
|
| 97 |
+
- Validate that streaming chunk boundaries and audio quality remain acceptable.
|
| 98 |
+
|
| 99 |
+
### 5. Treat CUDA graphs as a selective experiment, not a blanket recommendation
|
| 100 |
+
|
| 101 |
+
Why this is not earlier:
|
| 102 |
+
|
| 103 |
+
- `torch.compile(mode="reduce-overhead")` relies on CUDA graphs.
|
| 104 |
+
- CUDA graphs prefer stable shapes and stable memory addresses.
|
| 105 |
+
- The realtime backbone has growing KV cache and variable request lengths, so a global switch is not automatically a win.
|
| 106 |
+
|
| 107 |
+
Actions:
|
| 108 |
+
|
| 109 |
+
- If needed, add shape bucketing for common request sizes before trying broader CUDA-graph capture.
|
| 110 |
+
- Test `reduce-overhead` only on stable-shape regions or after bucketing proves shape churn is limited.
|
| 111 |
+
- Record recompilation behavior, graph re-recording frequency, and memory overhead before adopting it.
|
| 112 |
+
|
| 113 |
+
Success criteria:
|
| 114 |
+
|
| 115 |
+
- Keep it only if it improves real end-to-end latency under realistic traffic, not just microbenchmarks.
|
| 116 |
+
- Reject it if dynamic-shape behavior causes graph churn, memory bloat, or operational fragility.
|
| 117 |
+
|
| 118 |
+
### 6. Treat quantization as a VRAM and throughput project, not an assumed speed win
|
| 119 |
+
|
| 120 |
+
Why this is later:
|
| 121 |
+
|
| 122 |
+
- The current repo does not appear to have backbone quantized serving wired into the realtime path.
|
| 123 |
+
- Quantization can help memory a lot, but speedups for autoregressive decode depend heavily on kernels and runtime support.
|
| 124 |
+
- It also adds model-quality risk.
|
| 125 |
+
|
| 126 |
+
Actions:
|
| 127 |
+
|
| 128 |
+
- Consider INT8 or INT4 only after phases 1 through 5 are measured.
|
| 129 |
+
- Frame the objective as one of:
|
| 130 |
+
- fitting more concurrent requests on one GPU
|
| 131 |
+
- reducing VRAM enough to colocate more services
|
| 132 |
+
- improving throughput if the chosen backend actually accelerates decode
|
| 133 |
+
- Add listening tests and latency benchmarks before adopting any quantized path.
|
| 134 |
+
|
| 135 |
+
Success criteria:
|
| 136 |
+
|
| 137 |
+
- Keep it only if audio quality remains acceptable and measured throughput improves in practice.
|
| 138 |
+
|
| 139 |
+
### 7. Defer high-risk research work until the simpler wins are exhausted
|
| 140 |
+
|
| 141 |
+
This includes:
|
| 142 |
+
|
| 143 |
+
- speculative decoding
|
| 144 |
+
- overlapping codec work on separate CUDA streams
|
| 145 |
+
- major architecture changes to the streaming pipeline
|
| 146 |
+
|
| 147 |
+
Why this is last:
|
| 148 |
+
|
| 149 |
+
- These ideas may help, but they are not low-risk or quick-return changes.
|
| 150 |
+
- They should only start after the system has a stable benchmark harness and clear bottleneck data.
|
| 151 |
+
|
| 152 |
+
## Decision gates
|
| 153 |
+
|
| 154 |
+
At each phase, record:
|
| 155 |
+
|
| 156 |
+
- p50 TTFB
|
| 157 |
+
- p95 TTFB
|
| 158 |
+
- end-to-end latency
|
| 159 |
+
- RTF
|
| 160 |
+
- peak VRAM
|
| 161 |
+
- steady-state VRAM
|
| 162 |
+
- startup and warmup time
|
| 163 |
+
- requests per minute at a fixed latency target
|
| 164 |
+
- basic audio quality checks
|
| 165 |
+
|
| 166 |
+
Do not move to the next invasive phase unless the previous phase is measured and documented.
|
| 167 |
+
|
| 168 |
+
## Immediate next steps
|
| 169 |
+
|
| 170 |
+
1. Move the service to one idle RTX 3090 and rerun the existing benchmark set.
|
| 171 |
+
2. Add timing breakdowns to the realtime path.
|
| 172 |
+
3. Decide whether the next engineering target is concurrency tuning or codec acceleration based on those numbers.
|
| 173 |
+
|
| 174 |
+
## Relevant code paths
|
| 175 |
+
|
| 176 |
+
- `start_moss_tts.sh`
|
| 177 |
+
- `moss_tts_realtime/app.py`
|
| 178 |
+
- `moss_tts_realtime/openai_api.py`
|
| 179 |
+
- `moss_tts_realtime/mossttsrealtime/streaming_mossttsrealtime.py`
|
| 180 |
+
- `moss_audio_tokenizer/onnx/inference.py`
|
| 181 |
+
- `moss_audio_tokenizer/trt/inference.py`
|
LICENSE
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Apache License
|
| 2 |
+
Version 2.0, January 2004
|
| 3 |
+
http://www.apache.org/licenses/
|
| 4 |
+
|
| 5 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 6 |
+
|
| 7 |
+
1. Definitions.
|
| 8 |
+
|
| 9 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 10 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 11 |
+
|
| 12 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 13 |
+
the copyright owner that is granting the License.
|
| 14 |
+
|
| 15 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 16 |
+
other entities that control, are controlled by, or are under common
|
| 17 |
+
control with that entity. For the purposes of this definition,
|
| 18 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 19 |
+
direction or management of such entity, whether by contract or
|
| 20 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 21 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 22 |
+
|
| 23 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 24 |
+
exercising permissions granted by this License.
|
| 25 |
+
|
| 26 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 27 |
+
including but not limited to software source code, documentation
|
| 28 |
+
source, and configuration files.
|
| 29 |
+
|
| 30 |
+
"Object" form shall mean any form resulting from mechanical
|
| 31 |
+
transformation or translation of a Source form, including but
|
| 32 |
+
not limited to compiled object code, generated documentation,
|
| 33 |
+
and conversions to other media types.
|
| 34 |
+
|
| 35 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 36 |
+
Object form, made available under the License, as indicated by a
|
| 37 |
+
copyright notice that is included in or attached to the work
|
| 38 |
+
(an example is provided in the Appendix below).
|
| 39 |
+
|
| 40 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 41 |
+
form, that is based on (or derived from) the Work and for which the
|
| 42 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 43 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 44 |
+
of this License, Derivative Works shall not include works that remain
|
| 45 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 46 |
+
the Work and Derivative Works thereof.
|
| 47 |
+
|
| 48 |
+
"Contribution" shall mean any work of authorship, including
|
| 49 |
+
the original version of the Work and any modifications or additions
|
| 50 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 51 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 52 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 53 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 54 |
+
means any form of electronic, verbal, or written communication sent
|
| 55 |
+
to the Licensor or its representatives, including but not limited to
|
| 56 |
+
communication on electronic mailing lists, source code control systems,
|
| 57 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 58 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 59 |
+
excluding communication that is conspicuously marked or otherwise
|
| 60 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 61 |
+
|
| 62 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 63 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 64 |
+
subsequently incorporated within the Work.
|
| 65 |
+
|
| 66 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 67 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 68 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 69 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 70 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 71 |
+
Work and such Derivative Works in Source or Object form.
|
| 72 |
+
|
| 73 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 74 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 75 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 76 |
+
(except as stated in this section) patent license to make, have made,
|
| 77 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 78 |
+
where such license applies only to those patent claims licensable
|
| 79 |
+
by such Contributor that are necessarily infringed by their
|
| 80 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 81 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 82 |
+
institute patent litigation against any entity (including a
|
| 83 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 84 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 85 |
+
or contributory patent infringement, then any patent licenses
|
| 86 |
+
granted to You under this License for that Work shall terminate
|
| 87 |
+
as of the date such litigation is filed.
|
| 88 |
+
|
| 89 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 90 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 91 |
+
modifications, and in Source or Object form, provided that You
|
| 92 |
+
meet the following conditions:
|
| 93 |
+
|
| 94 |
+
(a) You must give any other recipients of the Work or
|
| 95 |
+
Derivative Works a copy of this License; and
|
| 96 |
+
|
| 97 |
+
(b) You must cause any modified files to carry prominent notices
|
| 98 |
+
stating that You changed the files; and
|
| 99 |
+
|
| 100 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 101 |
+
that You distribute, all copyright, patent, trademark, and
|
| 102 |
+
attribution notices from the Source form of the Work,
|
| 103 |
+
excluding those notices that do not pertain to any part of
|
| 104 |
+
the Derivative Works; and
|
| 105 |
+
|
| 106 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 107 |
+
distribution, then any Derivative Works that You distribute must
|
| 108 |
+
include a readable copy of the attribution notices contained
|
| 109 |
+
within such NOTICE file, excluding those notices that do not
|
| 110 |
+
pertain to any part of the Derivative Works, in at least one
|
| 111 |
+
of the following places: within a NOTICE text file distributed
|
| 112 |
+
as part of the Derivative Works; within the Source form or
|
| 113 |
+
documentation, if provided along with the Derivative Works; or,
|
| 114 |
+
within a display generated by the Derivative Works, if and
|
| 115 |
+
wherever such third-party notices normally appear. The contents
|
| 116 |
+
of the NOTICE file are for informational purposes only and
|
| 117 |
+
do not modify the License. You may add Your own attribution
|
| 118 |
+
notices within Derivative Works that You distribute, alongside
|
| 119 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 120 |
+
that such additional attribution notices cannot be construed
|
| 121 |
+
as modifying the License.
|
| 122 |
+
|
| 123 |
+
You may add Your own copyright statement to Your modifications and
|
| 124 |
+
may provide additional or different license terms and conditions
|
| 125 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 126 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 127 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 128 |
+
the conditions stated in this License.
|
| 129 |
+
|
| 130 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 131 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 132 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 133 |
+
this License, without any additional terms or conditions.
|
| 134 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 135 |
+
the terms of any separate license agreement you may have executed
|
| 136 |
+
with Licensor regarding such Contributions.
|
| 137 |
+
|
| 138 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 139 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 140 |
+
except as required for reasonable and customary use in describing the
|
| 141 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 142 |
+
|
| 143 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 144 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 145 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 146 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 147 |
+
implied, including, without limitation, any warranties or conditions
|
| 148 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 149 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 150 |
+
appropriateness of using or redistributing the Work and assume any
|
| 151 |
+
risks associated with Your exercise of permissions under this License.
|
| 152 |
+
|
| 153 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 154 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 155 |
+
unless required by applicable law (such as deliberate and grossly
|
| 156 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 157 |
+
liable to You for damages, including any direct, indirect, special,
|
| 158 |
+
incidental, or consequential damages of any character arising as a
|
| 159 |
+
result of this License or out of the use or inability to use the
|
| 160 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 161 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 162 |
+
other commercial damages or losses), even if such Contributor
|
| 163 |
+
has been advised of the possibility of such damages.
|
| 164 |
+
|
| 165 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 166 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 167 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 168 |
+
or other liability obligations and/or rights consistent with this
|
| 169 |
+
License. However, in accepting such obligations, You may act only
|
| 170 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
| 171 |
+
of any other Contributor, and only if You agree to indemnify,
|
| 172 |
+
defend, and hold each Contributor harmless for any liability
|
| 173 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 174 |
+
of your accepting any such warranty or additional liability.
|
| 175 |
+
|
| 176 |
+
END OF TERMS AND CONDITIONS
|
| 177 |
+
|
| 178 |
+
APPENDIX: How to apply the Apache License to your work.
|
| 179 |
+
|
| 180 |
+
To apply the Apache License to your work, attach the following
|
| 181 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
| 182 |
+
replaced with your own identifying information. (Don't include
|
| 183 |
+
the brackets!) The text should be enclosed in the appropriate
|
| 184 |
+
comment syntax for the file format. We also recommend that a
|
| 185 |
+
file or class name and description of purpose be included on the
|
| 186 |
+
same "printed page" as the copyright notice for easier
|
| 187 |
+
identification within third-party archives.
|
| 188 |
+
|
| 189 |
+
Copyright 2026 OpenMOSS Team, Fudan University, SII and MOSI
|
| 190 |
+
|
| 191 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 192 |
+
you may not use this file except in compliance with the License.
|
| 193 |
+
You may obtain a copy of the License at
|
| 194 |
+
|
| 195 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 196 |
+
|
| 197 |
+
Unless required by applicable law or agreed to in writing, software
|
| 198 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 199 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 200 |
+
See the License for the specific language governing permissions and
|
| 201 |
+
limitations under the License.
|
MANIFEST.in
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
include README.md
|
| 2 |
+
include README_zh.md
|
| 3 |
+
include LICENSE
|
| 4 |
+
include pyproject.toml
|
| 5 |
+
|
| 6 |
+
graft assets
|
| 7 |
+
graft docs
|
| 8 |
+
graft moss_tts_delay
|
| 9 |
+
graft moss_tts_local
|
| 10 |
+
graft moss_tts_realtime
|
| 11 |
+
graft moss_audio_tokenizer
|
| 12 |
+
|
| 13 |
+
prune .git
|
| 14 |
+
prune .github
|
| 15 |
+
prune .vscode
|
| 16 |
+
prune moss_tts.egg-info
|
| 17 |
+
|
| 18 |
+
global-exclude __pycache__
|
| 19 |
+
global-exclude __pycache__/*
|
| 20 |
+
global-exclude *.py[cod]
|
| 21 |
+
global-exclude .DS_Store
|
| 22 |
+
global-exclude *.so
|
OPTIMIZATION_REPORT.md
ADDED
|
@@ -0,0 +1,296 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS Realtime — RTX 3090 Performance Optimization Report
|
| 2 |
+
|
| 3 |
+
**Date**: 2026-03-13
|
| 4 |
+
**GPU**: NVIDIA GeForce RTX 3090 (24 GB, Ampere SM 8.6)
|
| 5 |
+
**Stack**: PyTorch 2.9.1+cu128, transformers 5.0.0, flash-attn 2.8.3, triton 3.5.1
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Executive Summary
|
| 10 |
+
|
| 11 |
+
Optimized the MOSS-TTS-Realtime OpenAI-compatible server for maximum throughput
|
| 12 |
+
and minimum latency on the RTX 3090. Through four targeted changes to the
|
| 13 |
+
inference pipeline, achieved:
|
| 14 |
+
|
| 15 |
+
| Metric | Before | After | Improvement |
|
| 16 |
+
|------------|---------|---------|-------------|
|
| 17 |
+
| **RTF** | 0.8044 | 0.3352 | **−58%** (2.4× throughput) |
|
| 18 |
+
| **TTFB** | 1557 ms | 586 ms | **−62%** (2.7× faster first chunk) |
|
| 19 |
+
| **Min RTF**| 0.7858 | 0.3073 | Best-case 3.3× realtime |
|
| 20 |
+
|
| 21 |
+
40.5 seconds of audio generated in 13.4 seconds (wall-clock) across 7 test
|
| 22 |
+
sentences of varying length (12–202 characters).
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Architecture Overview
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 30 |
+
│ OpenAI API (FastAPI + uvicorn) :8012 │
|
| 31 |
+
│ POST /v1/audio/speech │
|
| 32 |
+
├─────────────────────────────────────────────────────────────┤
|
| 33 |
+
│ Streaming Pipeline │
|
| 34 |
+
│ TokenChunkStream → StreamingSession → AudioStreamDecoder │
|
| 35 |
+
├─────────────────────────────────────────────────────────────┤
|
| 36 |
+
│ Backbone Transformer (Qwen3-1.7B, 28 layers) │
|
| 37 |
+
│ 2048 hidden, 16 heads, 8 KV heads, bfloat16 │
|
| 38 |
+
│ → Produces hidden states per token step │
|
| 39 |
+
├─────────────────────────────────────────────────────────────┤
|
| 40 |
+
│ Local Transformer (4 layers, 2048 hidden) │
|
| 41 |
+
│ 16 sequential RVQ codebook steps per backbone step │
|
| 42 |
+
│ → Produces 16-channel audio tokens │
|
| 43 |
+
├─────────────────────────────────────────────────────────────┤
|
| 44 |
+
│ MOSS Audio Tokenizer (codec) │
|
| 45 |
+
│ Decodes RVQ tokens → 24 kHz waveform │
|
| 46 |
+
│ Streaming mode with crossfade overlap │
|
| 47 |
+
└─────────────────────────────────────────────────────────────┘
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Bottlenecks Identified & Fixes Applied
|
| 53 |
+
|
| 54 |
+
### 1. Local Transformer: DynamicCache + No Compilation (HIGH IMPACT)
|
| 55 |
+
|
| 56 |
+
**Problem**: When the backbone used `flash_attention_2`, the local transformer
|
| 57 |
+
inherited the same attention implementation. This forced `DynamicCache` and
|
| 58 |
+
**disabled `torch.compile`** for the 16-step RVQ decode loop — the innermost
|
| 59 |
+
hot loop of the entire pipeline.
|
| 60 |
+
|
| 61 |
+
The local transformer processes sequences of only 16 tokens (one per RVQ
|
| 62 |
+
codebook). Flash attention provides negligible benefit at this length, while
|
| 63 |
+
`StaticCache` + `torch.compile(fullgraph=True)` can fuse the entire loop into
|
| 64 |
+
optimized Triton kernels.
|
| 65 |
+
|
| 66 |
+
**Fix** (`streaming_mossttsrealtime.py`, `inferencer.py`):
|
| 67 |
+
- Override local transformer config `_attn_implementation` → `"sdpa"`
|
| 68 |
+
- Force `StaticCache(max_cache_len=16)` regardless of backbone attention
|
| 69 |
+
- Enable `torch.compile(fullgraph=True)` for the local transformer always
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
# Before: cache/compile depended on backbone attn
|
| 73 |
+
self._use_dynamic_local_cache = attn_impl == "flash_attention_2" # True → no compile
|
| 74 |
+
self._should_compile_local_transformer = not self._use_dynamic_local_cache # False
|
| 75 |
+
|
| 76 |
+
# After: always StaticCache + compile for local transformer
|
| 77 |
+
local_cfg._attn_implementation = "sdpa"
|
| 78 |
+
self._use_dynamic_local_cache = False
|
| 79 |
+
self._should_compile_local_transformer = True
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
### 2. Backbone: SDPA + torch.compile (HIGH IMPACT)
|
| 83 |
+
|
| 84 |
+
**Problem**: The backbone Qwen3 language model ran uncompiled. Additionally,
|
| 85 |
+
`flash_attention_2` prevented `torch.compile` from generating fused Triton
|
| 86 |
+
kernels for the attention + MLP blocks.
|
| 87 |
+
|
| 88 |
+
**Finding**: Benchmarking revealed that on the RTX 3090 (Ampere),
|
| 89 |
+
`SDPA + torch.compile(dynamic=True)` dramatically outperforms
|
| 90 |
+
`flash_attention_2` for typical TTS sequence lengths:
|
| 91 |
+
|
| 92 |
+
| Config | RTF | TTFB |
|
| 93 |
+
|-------------------------------|-------|---------|
|
| 94 |
+
| flash_attn2, no compile | 0.805 | 1557 ms |
|
| 95 |
+
| flash_attn2 + compile | 0.490 | 950 ms |
|
| 96 |
+
| **SDPA + compile (dynamic)** | **0.335** | **586 ms** |
|
| 97 |
+
|
| 98 |
+
**Fix** (`app.py`):
|
| 99 |
+
- Switch default attention to `"sdpa"` (auto-detected)
|
| 100 |
+
- Add `torch.compile(mode="default", dynamic=True)` for the backbone
|
| 101 |
+
- `dynamic=True` prevents shape-triggered recompilation as KV cache grows
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
model.language_model = torch.compile(
|
| 105 |
+
model.language_model,
|
| 106 |
+
mode="default",
|
| 107 |
+
fullgraph=False,
|
| 108 |
+
dynamic=True,
|
| 109 |
+
)
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
### 3. No Startup Warmup (HIGH TTFB IMPACT)
|
| 113 |
+
|
| 114 |
+
**Problem**: `WARMUP_ON_START` defaulted to `false`. The first user request
|
| 115 |
+
suffered a 30–70 second cold-start penalty while `torch.compile` generated
|
| 116 |
+
and cached Triton kernels.
|
| 117 |
+
|
| 118 |
+
**Fix** (`openai_api.py`):
|
| 119 |
+
- Default `WARMUP_ON_START` to `true`
|
| 120 |
+
- Run 2 warmup generations with different text lengths during `lifespan()`
|
| 121 |
+
- Server starts accepting traffic only after caches are hot (~2.5 min startup)
|
| 122 |
+
|
| 123 |
+
### 4. CUDA Runtime Enhancements (LOW-MEDIUM IMPACT)
|
| 124 |
+
|
| 125 |
+
**Fix** (`app.py`):
|
| 126 |
+
- Increased `torch._dynamo.config.cache_size_limit` from 64 → 128
|
| 127 |
+
- Added `torch._dynamo.config.suppress_errors = True` for graceful fallback
|
| 128 |
+
- Added CUDA memory pool pre-allocation (`set_per_process_memory_fraction`)
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Why SDPA Beats Flash Attention on 3090
|
| 133 |
+
|
| 134 |
+
Flash Attention 2 is a hand-written CUDA kernel optimized for long sequences
|
| 135 |
+
(>512 tokens). It bypasses PyTorch's operator fusion.
|
| 136 |
+
|
| 137 |
+
SDPA (`torch.nn.functional.scaled_dot_product_attention`) is a PyTorch-native
|
| 138 |
+
op that `torch.compile` can fuse with surrounding operations (layernorm, MLP,
|
| 139 |
+
residual connections) into a single Triton kernel graph.
|
| 140 |
+
|
| 141 |
+
For MOSS-TTS-Realtime:
|
| 142 |
+
- **Backbone decode step**: 1 token input, KV cache grows to ~100–400 tokens
|
| 143 |
+
- **Local transformer**: 16 tokens total
|
| 144 |
+
|
| 145 |
+
At these lengths, the kernel launch overhead of flash_attn's separate CUDA
|
| 146 |
+
kernel dominates over its memory-access advantage. `torch.compile` + SDPA
|
| 147 |
+
generates fused Triton kernels that eliminate this overhead entirely.
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## Benchmark Details
|
| 152 |
+
|
| 153 |
+
### Test Configuration
|
| 154 |
+
- 7 sentences, 12–202 characters
|
| 155 |
+
- Voice preset: alloy (prompt audio cloning)
|
| 156 |
+
- Response format: WAV (no codec encoding overhead)
|
| 157 |
+
- All requests sequential (no concurrency)
|
| 158 |
+
|
| 159 |
+
### Before (Baseline)
|
| 160 |
+
```
|
| 161 |
+
flash_attention_2, no torch.compile, no warmup
|
| 162 |
+
[1] RTF=0.8213 TTFB=1579ms audio=5.56s chars=79
|
| 163 |
+
[2] RTF=0.7858 TTFB=1539ms audio=6.35s chars=97
|
| 164 |
+
[3] RTF=0.8061 TTFB=1553ms audio=7.41s chars=113
|
| 165 |
+
Avg: RTF=0.8044 TTFB=1557ms
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
### After (Optimized)
|
| 169 |
+
```
|
| 170 |
+
SDPA + torch.compile(dynamic=True), StaticCache local transformer, warmup
|
| 171 |
+
[1] RTF=0.3578 TTFB=455ms audio=1.28s chars=12
|
| 172 |
+
[2] RTF=0.3182 TTFB=623ms audio=2.71s chars=32
|
| 173 |
+
[3] RTF=0.3073 TTFB=596ms audio=7.28s chars=79
|
| 174 |
+
[4] RTF=0.4287 TTFB=590ms audio=6.15s chars=97
|
| 175 |
+
[5] RTF=0.3091 TTFB=623ms audio=6.15s chars=113
|
| 176 |
+
[6] RTF=0.3128 TTFB=602ms audio=7.55s chars=144
|
| 177 |
+
[7] RTF=0.3126 TTFB=616ms audio=9.40s chars=202
|
| 178 |
+
Avg: RTF=0.3352 TTFB=586ms
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## Files Modified
|
| 184 |
+
|
| 185 |
+
| File | Changes |
|
| 186 |
+
|------|---------|
|
| 187 |
+
| `moss_tts_realtime/mossttsrealtime/streaming_mossttsrealtime.py` | Force local transformer → SDPA + StaticCache + torch.compile |
|
| 188 |
+
| `moss_tts_realtime/inferencer.py` | Same local transformer fix (standalone inference path) |
|
| 189 |
+
| `moss_tts_realtime/app.py` | Backbone torch.compile, CUDA runtime enhancements, dynamo config |
|
| 190 |
+
| `moss_tts_realtime/openai_api.py` | Default SDPA, warmup on start, multi-shape warmup |
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## Server Configuration
|
| 195 |
+
|
| 196 |
+
The optimized server runs at `http://127.0.0.1:8012` with:
|
| 197 |
+
- `MOSS_TTS_ATTN_IMPLEMENTATION=auto` (resolves to `sdpa`)
|
| 198 |
+
- `MOSS_TTS_WARMUP_ON_START=true` (default)
|
| 199 |
+
- `MOSS_TTS_COMPILE_BACKBONE=true` (default)
|
| 200 |
+
- Startup time: ~2.5 minutes (model load + warmup compilation)
|
| 201 |
+
|
| 202 |
+
To force flash_attention_2 (if needed for other hardware):
|
| 203 |
+
```bash
|
| 204 |
+
export MOSS_TTS_ATTN_IMPLEMENTATION=flash_attention_2
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
To disable backbone compilation (if stability issues arise):
|
| 208 |
+
```bash
|
| 209 |
+
export MOSS_TTS_COMPILE_BACKBONE=false
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## Remaining Optimization Opportunities
|
| 215 |
+
|
| 216 |
+
1. **ONNX/TensorRT codec decode**: The audio tokenizer codec runs as PyTorch.
|
| 217 |
+
ONNX or TRT backends exist in `moss_audio_tokenizer/` but aren't wired
|
| 218 |
+
into the streaming path. Could reduce decode latency.
|
| 219 |
+
|
| 220 |
+
2. **CUDA graphs for fixed-shape decode**: If inputs were padded to discrete
|
| 221 |
+
bucket sizes, `reduce-overhead` mode could capture full CUDA graphs for
|
| 222 |
+
near-zero kernel launch overhead.
|
| 223 |
+
|
| 224 |
+
3. **Quantized backbone**: The backbone runs in bfloat16. INT8/INT4
|
| 225 |
+
quantization (via GPTQ, AWQ, or torch.ao) could further reduce compute.
|
| 226 |
+
|
| 227 |
+
4. **Speculative decoding**: Pre-generate multiple RVQ frames speculatively
|
| 228 |
+
and verify — could reduce effective RTF further.
|
| 229 |
+
|
| 230 |
+
5. **Multi-stream codec decode**: Overlap codec decoding with backbone
|
| 231 |
+
generation using separate CUDA streams.
|
| 232 |
+
|
| 233 |
+
---
|
| 234 |
+
|
| 235 |
+
## RTX 3060 Deployment (2026-03-13 Update)
|
| 236 |
+
|
| 237 |
+
### Goal
|
| 238 |
+
Run the optimized server on an RTX 3060 (12 GB VRAM) alongside existing 3090 workloads.
|
| 239 |
+
|
| 240 |
+
### VRAM Challenge & Fix
|
| 241 |
+
|
| 242 |
+
| Component | float32 | bfloat16 | float16 |
|
| 243 |
+
|-----------|---------|----------|---------|
|
| 244 |
+
| MOSS-TTS-Realtime backbone | — | 4,400 MiB | — |
|
| 245 |
+
| MOSS-Audio-Tokenizer codec | 6,860 MiB | ❌ unsupported¹ | 3,430 MiB |
|
| 246 |
+
| **Total on 3060** | > 12 GB ❌ | — | **~9,200 MiB ✅** |
|
| 247 |
+
|
| 248 |
+
¹ The codec's custom CUDA kernels raise "Got unsupported ScalarType BFloat16" — float16 is the only viable half-precision dtype.
|
| 249 |
+
|
| 250 |
+
**Fix**: `_load_codec()` in `app.py` now explicitly loads the codec with `torch_dtype=torch.float16` and wraps it in `_BF16CodecWrapper` (autocast float16 context). The backbone remains in bfloat16.
|
| 251 |
+
|
| 252 |
+
### Results on RTX 3060
|
| 253 |
+
|
| 254 |
+
- **VRAM used**: 9,174 MiB / 12,288 MiB
|
| 255 |
+
- **Server startup**: ~90 seconds (model load + torch.compile + 2× warmup)
|
| 256 |
+
- **Inference latency** (warm, SDPA + torch.compile):
|
| 257 |
+
- Short sentence (20 chars): ~1.8 s total
|
| 258 |
+
- Medium sentence (87 chars): ~3.0–4.7 s total
|
| 259 |
+
- **Health check**: `GET /health` → `{"status":"ok","device":"cuda:0","attn_implementation":"sdpa"}`
|
| 260 |
+
|
| 261 |
+
### Deployment Infrastructure
|
| 262 |
+
|
| 263 |
+
**Shell launcher** (`start_moss_tts.sh` — default launch method):
|
| 264 |
+
```bash
|
| 265 |
+
./start_moss_tts.sh
|
| 266 |
+
# or with overrides:
|
| 267 |
+
MOSS_TTS_GPU=GPU-other MOSS_TTS_PORT=8013 ./start_moss_tts.sh
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
**Systemd service** (auto-start on boot):
|
| 271 |
+
```bash
|
| 272 |
+
sudo systemctl enable --now moss-tts.service
|
| 273 |
+
sudo journalctl -u moss-tts -f
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
Service configuration (`/etc/systemd/system/moss-tts.service`):
|
| 277 |
+
- GPU: `CUDA_VISIBLE_DEVICES=GPU-cbfc8a5f-0df1-ca71-f704-0d09a707d2ac` (RTX 3060)
|
| 278 |
+
- All optimizations enabled: `MOSS_TTS_COMPILE_BACKBONE=true`, `MOSS_TTS_WARMUP_ON_START=true`
|
| 279 |
+
- Restart on failure with 10s backoff, 300s startup timeout
|
| 280 |
+
|
| 281 |
+
**API endpoint** (OpenAI-compatible):
|
| 282 |
+
```
|
| 283 |
+
POST http://0.0.0.0:8012/v1/audio/speech
|
| 284 |
+
GET http://0.0.0.0:8012/v1/audio/models
|
| 285 |
+
GET http://0.0.0.0:8012/v1/audio/voices
|
| 286 |
+
GET http://0.0.0.0:8012/health
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
### Git History
|
| 290 |
+
|
| 291 |
+
| Commit | Description |
|
| 292 |
+
|--------|-------------|
|
| 293 |
+
| `6a89aff` | feat: RTX 3090 performance optimizations (flash_attn, torch.compile, SDPA, warmup) |
|
| 294 |
+
| `2f0b604` | fix: use float16 for codec to support RTX 3060, add launcher script |
|
| 295 |
+
|
| 296 |
+
Repository: https://github.com/groxaxo/MOSS-TTS
|
README.md
ADDED
|
@@ -0,0 +1,548 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS Family
|
| 2 |
+
|
| 3 |
+
<br>
|
| 4 |
+
|
| 5 |
+
<p align="center">
|
| 6 |
+
<img src="./assets/OpenMOSS_Logo.png" height="70" align="middle" />
|
| 7 |
+
|
| 8 |
+
<img src="./assets/mosi-logo.png" height="50" align="middle" />
|
| 9 |
+
</p>
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<a href="https://clawhub.ai/luogao2333/moss-tts-voice"><img src="https://img.shields.io/badge/🦞_OpenClaw-Skills-8A2BE2" alt="OpenClaw"></a>
|
| 16 |
+
<a href="https://huggingface.co/collections/OpenMOSS-Team/moss-tts"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&"></a>
|
| 17 |
+
<a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&"></a>
|
| 18 |
+
<a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&"></a>
|
| 19 |
+
<a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&"></a>
|
| 20 |
+
|
| 21 |
+
<a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&"></a>
|
| 22 |
+
<a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&"></a>
|
| 23 |
+
<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&"></a>
|
| 24 |
+
<a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&"></a>
|
| 25 |
+
<a href="./assets/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&logoColor=white" alt="WeChat"></a>
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
[English](README.md) | [简体中文](README_zh.md)
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
|
| 33 |
+
|
| 34 |
+
## News
|
| 35 |
+
* 2026.3.10: ⚡️ Significantly optimized the VRAM usage of llama.cpp inference pipeline. Now 8B model fits onto 8GB GPUs !
|
| 36 |
+
* 2026.3.4: 🚀 Added **PyTorch-free inference support** — enabling lightweight on-device deployment via **llama.cpp + ONNX Runtime**. Quantized **GGUF weights** are released at [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF), and the **ONNX audio tokenizer** is available at [OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX). See the [llama.cpp backend](#llamacpp-backend-torch-free-inference) for details.
|
| 37 |
+
* 2026.3.4: 🎉 We add MOSS-TTS skills in [ClawHub](https://clawhub.ai) of 🦞 OpenClaw: [feishu-voice-tts](https://clawhub.ai/helloeveryworlds/feishu-voice-tts) and [moss-tts-voice](https://clawhub.ai/luogao2333/moss-tts-voice).
|
| 38 |
+
* 2026.2.10: 🎉🎉🎉 We have released [MOSS-TTS Family](https://huggingface.co/collections/OpenMOSS-Team/moss-tts). Check our [Blog](https://mosi.cn/#models) for more details! Our **Huggingface Space** is here: [MOSS-TTS](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS), [MOSS-TTSD-v1.0](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTSD-v1.0), [MOSS-VoiceGenerator](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-VoiceGenerator).
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## Demo
|
| 42 |
+
|
| 43 |
+
<div align="center">
|
| 44 |
+
<video src="https://gist.github.com/user-attachments/assets/fdce9f66-20ec-45e8-9615-89606ae2fbe8" width="70%" poster=""> </video>
|
| 45 |
+
</div>
|
| 46 |
+
|
| 47 |
+
## Contents
|
| 48 |
+
|
| 49 |
+
- [Introduction](#introduction)
|
| 50 |
+
- [Model Architecture](#model-architecture)
|
| 51 |
+
- [Released Models](#released-models)
|
| 52 |
+
- [Supported Languages](#supported-languages)
|
| 53 |
+
- [Quickstart](#quickstart)
|
| 54 |
+
- [OpenClaw API Skills](#openclaw-api-skills)
|
| 55 |
+
- [Environment Setup](#environment-setup)
|
| 56 |
+
- [(Optional) Install FlashAttention 2](#optional-install-flashattention-2)
|
| 57 |
+
- [MOSS-TTS Basic Usage](#moss-tts-basic-usage)
|
| 58 |
+
- [Fine-Tuning](#fine-tuning)
|
| 59 |
+
- [llama.cpp Backend (Torch-Free Inference)](#llamacpp-backend-torch-free-inference)
|
| 60 |
+
- [Evaluation](#evaluation)
|
| 61 |
+
- [MOSS-TTS](#moss-tts-seed-tts-eval)
|
| 62 |
+
- [MOSS-TTSD](#moss-ttsd-subjective--ttsd-eval)
|
| 63 |
+
- [MOSS-VoiceGenerator](#moss-voicegenerator-subjective)
|
| 64 |
+
- [MOSS-Audio-Tokenizer](#moss-audio-tokenizer)
|
| 65 |
+
- [Introduction](#mat-intro)
|
| 66 |
+
- [Model Weights](#model-weights)
|
| 67 |
+
- [Objective Reconstruction Evaluation](#objective-reconstruction-evaluation)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
## Introduction
|
| 71 |
+
|
| 72 |
+
<p align="center">
|
| 73 |
+
<img src="./assets/moss_tts_family.jpeg" width="85%" />
|
| 74 |
+
</p>
|
| 75 |
+
|
| 76 |
+
When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
|
| 77 |
+
|
| 78 |
+
- **MOSS‑TTS**: The flagship production model featuring high fidelity and optimal zero-shot voice cloning. It supports **long-speech generation**, **fine-grained control over Pinyin, phonemes, and duration**, as well as **multilingual/code-switched synthesis**.
|
| 79 |
+
- **MOSS‑TTSD**: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues. The new **v1.0 version** achieves **industry-leading performance on objective metrics** and **outperformed top closed-source models like Doubao and Gemini 2.5-pro** in subjective evaluations. You can visit the [MOSS-TTSD repository](https://github.com/OpenMOSS/MOSS-TTSD) for details.
|
| 80 |
+
- **MOSS‑VoiceGenerator**: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, **without any reference speech**. It unifies voice design, style control, and synthesis, functioning independently or as a design layer for downstream TTS. Its performance **surpasses other top-tier voice design models in arena ratings**.
|
| 81 |
+
- **MOSS‑TTS‑Realtime**: A multi-turn context-aware model for real-time voice agents. It uses incremental synthesis to ensure natural and coherent replies, making it **ideal for building low-latency voice agents when paired with text models**.
|
| 82 |
+
- **MOSS‑SoundEffect**: A content creation model specialized in **sound effect generation** with wide category coverage and controllable duration. It generates audio for natural environments, urban scenes, biological sounds, human actions, and musical fragments, suitable for film, games, and interactive experiences.
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## Model Architecture
|
| 86 |
+
|
| 87 |
+
We train **MossTTSDelay** and **MossTTSLocal** as complementary baselines under one training/evaluation setup: **Delay** emphasizes long-context stability, inference speed, and production readiness, while **Local** emphasizes lightweight flexibility and strong objective performance for streaming-oriented systems. Together they provide reproducible references for deployment and research.
|
| 88 |
+
|
| 89 |
+
**MossTTSRealtime** is not a third comparison baseline; it is a capability-driven design for voice agents. By modeling multi-turn context from both prior text and user acoustics, it delivers low-latency streaming speech that stays coherent and voice-consistent across turns.
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
| Architecture | Core Mechanism | Arch Details |
|
| 93 |
+
|---|---|---|
|
| 94 |
+
| `MossTTSDelay` | Multi‑head parallel RVQ prediction with delay‑pattern scheduling | [](moss_tts_delay/README.md) |
|
| 95 |
+
| `MossTTSLocal` | Time‑synchronous RVQ blocks with a depth transformer | [](moss_tts_local/README.md) |
|
| 96 |
+
| `MossTTSRealtime` | Hierarchical text–audio inputs for realtime synthesis | [](moss_tts_realtime/README.md) |
|
| 97 |
+
|
| 98 |
+
## Released Models
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
| Model | Architecture | Size | Model Card | Hugging Face | ModelScope |
|
| 102 |
+
|---|---|---:|---|---|---|
|
| 103 |
+
| **MOSS-TTS** | `MossTTSDelay` | 8B | [](docs/moss_tts_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) | [](https://modelscope.cn/models/openmoss/MOSS-TTS) |
|
| 104 |
+
| | `MossTTSLocal` | 1.7B | [](docs/moss_tts_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) | [](https://modelscope.cn/models/openmoss/MOSS-TTS-Local-Transformer) |
|
| 105 |
+
| **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 8B | [](docs/moss_ttsd_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) | [](https://modelscope.cn/models/openmoss/MOSS-TTSD-v1.0) |
|
| 106 |
+
| **MOSS‑VoiceGenerator** | `MossTTSDelay` | 1.7B | [](docs/moss_voice_generator_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-VoiceGenerator) | [](https://modelscope.cn/models/openmoss/MOSS-VoiceGenerator) |
|
| 107 |
+
| **MOSS‑SoundEffect** | `MossTTSDelay` | 8B | [](docs/moss_sound_effect_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) | [](https://modelscope.cn/models/openmoss/MOSS-SoundEffect) |
|
| 108 |
+
| **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 1.7B | [](docs/moss_tts_realtime_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) | [](https://modelscope.cn/models/openmoss/MOSS-TTS-Realtime) |
|
| 109 |
+
|
| 110 |
+
## Supported Languages
|
| 111 |
+
|
| 112 |
+
MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports **20 languages**:
|
| 113 |
+
|
| 114 |
+
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|
| 115 |
+
|---|---|---|---|---|---|---|---|---|
|
| 116 |
+
| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |
|
| 117 |
+
| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |
|
| 118 |
+
| Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |
|
| 119 |
+
| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |
|
| 120 |
+
| Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |
|
| 121 |
+
| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | | | |
|
| 122 |
+
| Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 | | | |
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
## Quickstart
|
| 126 |
+
|
| 127 |
+
### OpenClaw API Skills
|
| 128 |
+
|
| 129 |
+
We add MOSS-TTS skills in [ClawHub](https://clawhub.ai) of 🦞 OpenClaw. You can get your API key from [MOSI AI Studio](https://studio.mosi.cn).
|
| 130 |
+
|
| 131 |
+
| Skill | Description | Install |
|
| 132 |
+
|---|---|---|
|
| 133 |
+
| [`feishu-voice-tts`](https://clawhub.ai/helloeveryworlds/feishu-voice-tts) | Send voice messages in Feishu | `clawhub install feishu-voice-tts` |
|
| 134 |
+
| [`moss-tts-voice`](https://clawhub.ai/luogao2333/moss-tts-voice) | Call MOSS-TTS API to generate speech | `clawhub install moss-tts-voice` |
|
| 135 |
+
|
| 136 |
+
### Environment Setup
|
| 137 |
+
|
| 138 |
+
We recommend a clean, isolated Python environment with **Transformers 5.0.0** to avoid dependency conflicts.
|
| 139 |
+
|
| 140 |
+
#### Using Conda
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
conda create -n moss-tts python=3.12 -y
|
| 144 |
+
conda activate moss-tts
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
Install all required dependencies:
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 151 |
+
cd MOSS-TTS
|
| 152 |
+
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
#### Using `uv`
|
| 156 |
+
|
| 157 |
+
```bash
|
| 158 |
+
# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
|
| 159 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 160 |
+
cd MOSS-TTS
|
| 161 |
+
uv venv --python 3.12 .venv
|
| 162 |
+
source .venv/bin/activate
|
| 163 |
+
uv pip install --torch-backend cu128 -e ".[torch-runtime]"
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
#### (Optional) Install FlashAttention 2
|
| 167 |
+
|
| 168 |
+
For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.
|
| 169 |
+
|
| 170 |
+
If you use Conda/pip:
|
| 171 |
+
|
| 172 |
+
```bash
|
| 173 |
+
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
If your machine has limited RAM and many CPU cores, you can cap build parallelism:
|
| 177 |
+
|
| 178 |
+
```bash
|
| 179 |
+
MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
If you use `uv`:
|
| 183 |
+
|
| 184 |
+
```bash
|
| 185 |
+
uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
If your machine has limited RAM and many CPU cores, you can cap build parallelism:
|
| 189 |
+
|
| 190 |
+
```bash
|
| 191 |
+
MAX_JOBS=4 uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
Notes:
|
| 195 |
+
- Dependencies are managed in `pyproject.toml`, which currently pins `torch==2.9.1+cu128` and `torchaudio==2.9.1+cu128`.
|
| 196 |
+
- In `uv`, `--torch-backend cu128` lets uv fetch compatible PyTorch CUDA wheels and resolve the rest from PyPI with the default safe index strategy.
|
| 197 |
+
- If you need another backend, replace `cu128` with your target (for example, `cpu`, `cu126`).
|
| 198 |
+
- If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
|
| 199 |
+
- FlashAttention 2 is only available on supported GPUs and is typically used with `torch.float16` or `torch.bfloat16`.
|
| 200 |
+
|
| 201 |
+
#### OpenAI-compatible FastAPI server for Open WebUI
|
| 202 |
+
|
| 203 |
+
The realtime model now includes a FastAPI server with OpenAI-style TTS endpoints:
|
| 204 |
+
|
| 205 |
+
```bash
|
| 206 |
+
conda activate moss-tts
|
| 207 |
+
cd MOSS-TTS
|
| 208 |
+
MOSS_TTS_DEVICE=cuda:0 moss-tts-realtime-openai
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
Available routes:
|
| 212 |
+
|
| 213 |
+
- `POST /v1/audio/speech`
|
| 214 |
+
- `GET /v1/audio/models`
|
| 215 |
+
- `GET /v1/audio/voices`
|
| 216 |
+
|
| 217 |
+
Compatibility aliases are also exposed at `/audio/speech`, `/audio/models`, and `/audio/voices`.
|
| 218 |
+
|
| 219 |
+
For Open WebUI, point the custom TTS OpenAI base URL to either:
|
| 220 |
+
|
| 221 |
+
- `http://<host>:8012/v1`
|
| 222 |
+
- `http://<host>:8012`
|
| 223 |
+
|
| 224 |
+
The server reuses the MOSS-TTS-Realtime streaming backend, caches bundled prompt voices, and auto-selects the best available attention backend (`flash_attention_2` on supported Ampere GPUs when `flash-attn` is installed, otherwise `sdpa`).
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
<a id="moss-tts-basic-usage"></a>
|
| 228 |
+
### MOSS‑TTS Basic Usage
|
| 229 |
+
|
| 230 |
+
If you prefer Gradio demos, we provide 4 scripts for the main models:
|
| 231 |
+
|
| 232 |
+
| Model | Script | Run |
|
| 233 |
+
|---|---|---|
|
| 234 |
+
| MOSS-TTS | [clis/moss_tts_app.py](clis/moss_tts_app.py) |
|
| 235 |
+
| MOSS-TTSD | [clis/moss_ttsd_app.py](clis/moss_ttsd_app.py) |
|
| 236 |
+
| MOSS-VoiceGenerator | [clis/moss_voice_generator_app.py](clis/moss_voice_generator_app.py) |
|
| 237 |
+
| MOSS-SoundEffect | [clis/moss_sound_effect_app.py](clis/moss_sound_effect_app.py) |
|
| 238 |
+
|
| 239 |
+
For the MOSS-TTS-Realtime Gradio demo, please refer to [MOSS-TTS-Realtime Model Card](docs/moss_tts_realtime_model_card.md)
|
| 240 |
+
|
| 241 |
+
```python
|
| 242 |
+
from pathlib import Path
|
| 243 |
+
import importlib.util
|
| 244 |
+
import torch
|
| 245 |
+
import torchaudio
|
| 246 |
+
from transformers import AutoModel, AutoProcessor
|
| 247 |
+
# Disable the broken cuDNN SDPA backend
|
| 248 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 249 |
+
# Keep these enabled as fallbacks
|
| 250 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 251 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 252 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"
|
| 256 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 257 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 258 |
+
|
| 259 |
+
def resolve_attn_implementation() -> str:
|
| 260 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 261 |
+
if (
|
| 262 |
+
device == "cuda"
|
| 263 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 264 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 265 |
+
):
|
| 266 |
+
major, _ = torch.cuda.get_device_capability()
|
| 267 |
+
if major >= 8:
|
| 268 |
+
return "flash_attention_2"
|
| 269 |
+
|
| 270 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 271 |
+
if device == "cuda":
|
| 272 |
+
return "sdpa"
|
| 273 |
+
|
| 274 |
+
# CPU fallback.
|
| 275 |
+
return "eager"
|
| 276 |
+
|
| 277 |
+
|
| 278 |
+
attn_implementation = resolve_attn_implementation()
|
| 279 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 280 |
+
|
| 281 |
+
processor = AutoProcessor.from_pretrained(
|
| 282 |
+
pretrained_model_name_or_path,
|
| 283 |
+
trust_remote_code=True,
|
| 284 |
+
)
|
| 285 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 286 |
+
|
| 287 |
+
text_1 = "亲爱的你,\n你好呀。\n\n今天,我想用最认真、最温柔的声音,对你说一些重要的话。\n这些话,像一颗小小的星星,希望能在你的心里慢慢发光。\n\n首先,我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候,\n窗外有光,屋子里很安静,\n你的心是轻轻的,没有着急,也没有害怕。\n\n希望你吃饭的时候胃口很好,\n走路的时候脚步稳稳,\n晚上睡觉的时候,能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题,\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候,\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友,\n温柔地对待小动物,\n也温柔地对待自己。\n\n如果有一天你犯了错,\n请不要太快责怪自己,\n因为每一个认真成长的人,\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时,\n当你第一次举手发言时,\n当你遇到困难、感到害怕的时候,\n希望你能轻轻地告诉自己:\n“我可以试一试。”\n\n就算没有一次成功,也没有关系。\n失败不是坏事,\n它只是告诉你,你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人,\n把笑声送给身边的人,\n因为快乐被分享的时候,\n会变得更大、更亮。\n\n如果有一天你感到难过,\n我希望你知道——\n难过并不丢脸,\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方,\n慢慢把心里的话说出来,\n然后再一次抬起头,看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大,\n也许很小,\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大,\n在时间里慢慢变得清楚。\n\n最后,我想送你一个最最重要的祝福:\n\n愿你被世界温柔对待,\n也愿你成为一个温柔的人。\n\n愿你的每一天,\n都值得被记住,\n都值得被珍惜。\n\n亲爱的你,\n请记住,\n你是独一无二的,\n你已经很棒了,\n而你的未来,\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福,\n祝你永远带着笑容向前走。"
|
| 288 |
+
text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."
|
| 289 |
+
text_3 = "nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?"
|
| 290 |
+
text_4 = "nin2 hao3,qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3?"
|
| 291 |
+
text_5 = "您好,请问您来自哪 zuo4 cheng2 shi4?"
|
| 292 |
+
text_6 = "/həloʊ, meɪ aɪ æsk wɪt�� sɪti juː ɑːr frʌm?/"
|
| 293 |
+
|
| 294 |
+
# Use audio from ./assets/audio to avoid downloading from the cloud.
|
| 295 |
+
ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
|
| 296 |
+
ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
|
| 297 |
+
|
| 298 |
+
conversations = [
|
| 299 |
+
# Direct TTS (no reference)
|
| 300 |
+
[processor.build_user_message(text=text_1)],
|
| 301 |
+
[processor.build_user_message(text=text_2)],
|
| 302 |
+
# Pinyin or IPA input
|
| 303 |
+
[processor.build_user_message(text=text_3)],
|
| 304 |
+
[processor.build_user_message(text=text_4)],
|
| 305 |
+
[processor.build_user_message(text=text_5)],
|
| 306 |
+
[processor.build_user_message(text=text_6)],
|
| 307 |
+
# Voice cloning (with reference)
|
| 308 |
+
[processor.build_user_message(text=text_1, reference=[ref_audio_1])],
|
| 309 |
+
[processor.build_user_message(text=text_2, reference=[ref_audio_2])],
|
| 310 |
+
# Duration control
|
| 311 |
+
[processor.build_user_message(text=text_2, tokens=325)],
|
| 312 |
+
[processor.build_user_message(text=text_2, tokens=600)],
|
| 313 |
+
]
|
| 314 |
+
|
| 315 |
+
model = AutoModel.from_pretrained(
|
| 316 |
+
pretrained_model_name_or_path,
|
| 317 |
+
trust_remote_code=True,
|
| 318 |
+
attn_implementation=attn_implementation,
|
| 319 |
+
torch_dtype=dtype,
|
| 320 |
+
).to(device)
|
| 321 |
+
model.eval()
|
| 322 |
+
|
| 323 |
+
batch_size = 1
|
| 324 |
+
|
| 325 |
+
save_dir = Path("inference_root")
|
| 326 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 327 |
+
sample_idx = 0
|
| 328 |
+
with torch.no_grad():
|
| 329 |
+
for start in range(0, len(conversations), batch_size):
|
| 330 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 331 |
+
batch = processor(batch_conversations, mode="generation")
|
| 332 |
+
input_ids = batch["input_ids"].to(device)
|
| 333 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 334 |
+
|
| 335 |
+
outputs = model.generate(
|
| 336 |
+
input_ids=input_ids,
|
| 337 |
+
attention_mask=attention_mask,
|
| 338 |
+
max_new_tokens=4096,
|
| 339 |
+
)
|
| 340 |
+
|
| 341 |
+
for message in processor.decode(outputs):
|
| 342 |
+
audio = message.audio_codes_list[0]
|
| 343 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 344 |
+
sample_idx += 1
|
| 345 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 346 |
+
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
For each model’s full usage, please refer to its corresponding model card.
|
| 350 |
+
|
| 351 |
+
<a id="fine-tuning"></a>
|
| 352 |
+
## Fine-Tuning
|
| 353 |
+
|
| 354 |
+
Finetuning tutorials are organized by architecture.
|
| 355 |
+
|
| 356 |
+
Currently available:
|
| 357 |
+
|
| 358 |
+
- `MossTTSDelay` / `OpenMOSS-Team/MOSS-TTS`: [moss_tts_delay/finetuning/README.md](moss_tts_delay/finetuning/README.md)
|
| 359 |
+
|
| 360 |
+
Additional architecture-specific finetuning tutorials will be added under their corresponding directories.
|
| 361 |
+
|
| 362 |
+
## llama.cpp Backend (Torch-Free Inference)
|
| 363 |
+
|
| 364 |
+
For lightweight or edge deployment, MOSS-TTS supports a **torch-free** inference path using [llama.cpp](https://github.com/ggerganov/llama.cpp) for the Qwen3 backbone and ONNX Runtime / TensorRT for the audio tokenizer. No PyTorch installation required.
|
| 365 |
+
|
| 366 |
+
### Quick Start
|
| 367 |
+
|
| 368 |
+
```bash
|
| 369 |
+
# 1. Install (torch-free)
|
| 370 |
+
pip install -e ".[llama-cpp-onnx]"
|
| 371 |
+
|
| 372 |
+
# 2. Download pre-quantized backbone + embedding/lm_head weights
|
| 373 |
+
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF
|
| 374 |
+
|
| 375 |
+
# 3. Download ONNX audio tokenizer
|
| 376 |
+
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX
|
| 377 |
+
|
| 378 |
+
# 4. Build the C bridge (one-time, requires llama.cpp compiled from source)
|
| 379 |
+
cd moss_tts_delay/llama_cpp && bash build_bridge.sh /path/to/llama.cpp && cd ../..
|
| 380 |
+
|
| 381 |
+
# 5. Run inference
|
| 382 |
+
python -m moss_tts_delay.llama_cpp \
|
| 383 |
+
--config configs/llama_cpp/default.yaml \
|
| 384 |
+
--text "Hello, world!" --output output.wav
|
| 385 |
+
|
| 386 |
+
# 6. (Optional) Low-memory mode for 8 GB GPUs — loads/unloads components per stage
|
| 387 |
+
python -m moss_tts_delay.llama_cpp \
|
| 388 |
+
--config configs/llama_cpp/trt-8gb.yaml \
|
| 389 |
+
--text "Hello, world!" --output output.wav
|
| 390 |
+
```
|
| 391 |
+
|
| 392 |
+
### Installation Profiles
|
| 393 |
+
|
| 394 |
+
| Profile | Install Command | Dependencies | Use Case |
|
| 395 |
+
|---------|----------------|--------------|----------|
|
| 396 |
+
| **Torch-free (ONNX)** | `pip install -e ".[llama-cpp-onnx]"` | numpy, onnxruntime-gpu, tokenizers | Recommended starting point |
|
| 397 |
+
| **Torch-free (TRT)** | `pip install -e ".[llama-cpp-trt]"` | numpy, tensorrt, cuda-python | Maximum audio tokenizer speed (build engines yourself) |
|
| 398 |
+
| **Torch-accelerated** | `pip install -e ".[llama-cpp-onnx,llama-cpp-torch]"` | + torch | GPU-accelerated LM heads (~30x faster) |
|
| 399 |
+
|
| 400 |
+
> **Want to convert weights yourself?** See the [conversion guide](moss_tts_delay/llama_cpp/conversion/README.md) for step-by-step instructions on extracting, converting, and quantizing MOSS-TTS weights with llama.cpp.
|
| 401 |
+
|
| 402 |
+
### Model Weights
|
| 403 |
+
|
| 404 |
+
| Repository | Contents | Download |
|
| 405 |
+
|-----------|----------|----------|
|
| 406 |
+
| [`OpenMOSS-Team/MOSS-TTS-GGUF`](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Q4_K_M backbone `.gguf`, `embeddings/` (`.npy`), `lm_heads/` (`.npy`), tokenizer | `huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF` |
|
| 407 |
+
| [`OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX`](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX) | Encoder & decoder ONNX models | `huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX` |
|
| 408 |
+
|
| 409 |
+
> **Note:** We do **not** provide pre-built TensorRT engines, as they are tied to your specific GPU and TensorRT version. To use TRT, build engines from the ONNX models yourself — see `moss_audio_tokenizer/trt/build_engine.sh`.
|
| 410 |
+
|
| 411 |
+
### Configuration
|
| 412 |
+
|
| 413 |
+
Four pre-built configs are provided in `configs/llama_cpp/`:
|
| 414 |
+
|
| 415 |
+
- `default.yaml` — ONNX audio + GGUF backbone (recommended start)
|
| 416 |
+
- `trt.yaml` — TensorRT audio + GGUF backbone (max throughput, user-built engines)
|
| 417 |
+
- `trt-8gb.yaml` — Low-memory mode for 8 GB GPUs (staged loading, TRT audio)
|
| 418 |
+
- `cpu-only.yaml` — fully CPU-based (no GPU required)
|
| 419 |
+
|
| 420 |
+
Key config options:
|
| 421 |
+
- `heads_backend: auto | numpy | torch` — LM heads computation backend
|
| 422 |
+
- `audio_backend: onnx | trt | torch` — audio tokenizer backend
|
| 423 |
+
- `low_memory: true | false` — staged loading for limited VRAM (loads/unloads encoder, backbone, decoder per stage)
|
| 424 |
+
- `kv_cache_type_k / kv_cache_type_v` — KV cache quantization (e.g. `q8_0`, `q4_0`) to reduce VRAM
|
| 425 |
+
- `flash_attn: auto | enabled | disabled` — flash attention for lower peak VRAM during prefill
|
| 426 |
+
|
| 427 |
+
For full documentation, see [moss_tts_delay/llama_cpp/README.md](moss_tts_delay/llama_cpp/README.md).
|
| 428 |
+
|
| 429 |
+
## Evaluation
|
| 430 |
+
|
| 431 |
+
This section summarizes the **family‑level evaluation highlights** for MOSS‑TTS and MOSS‑VoiceGenerator. For full details, see each model’s model card.
|
| 432 |
+
|
| 433 |
+
### MOSS‑TTS
|
| 434 |
+
MOSS‑TTS achieved state‑of‑the‑art results on the open‑source zero‑shot TTS benchmark `Seed‑TTS‑eval`, surpassing all open‑source models and rivaling leading closed‑source systems.
|
| 435 |
+
|
| 436 |
+
| Model | Params | Open‑source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|
| 437 |
+
|---|---:|:---:|---:|---:|---:|---:|
|
| 438 |
+
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
|
| 439 |
+
| FishAudio‑S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
|
| 440 |
+
| Seed‑TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
|
| 441 |
+
| MiniMax‑Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
|
| 442 |
+
| | | | | | | |
|
| 443 |
+
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |
|
| 444 |
+
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |
|
| 445 |
+
| CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |
|
| 446 |
+
| CosyVoice3 | 1.5B | ✅ | 2.22 | 72 | 1.12 | 78.1 |
|
| 447 |
+
| F5‑TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |
|
| 448 |
+
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |
|
| 449 |
+
| FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |
|
| 450 |
+
| FireRedTTS‑2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |
|
| 451 |
+
| Qwen2.5‑Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |
|
| 452 |
+
| FishAudio‑S1‑mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
|
| 453 |
+
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
|
| 454 |
+
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
|
| 455 |
+
| HiggsAudio‑v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
|
| 456 |
+
| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | **0.93** | 77.2 |
|
| 457 |
+
| Qwen3‑TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
|
| 458 |
+
| Qwen3‑TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |
|
| 459 |
+
| GLM-TTS | 1.5B | ✅ | 2.23 | 67.2 | 1.03 | 76.1 |
|
| 460 |
+
| GLM-TTS-RL | 1.5B | ✅ | 1.91 | 68.1 | 0.89 | 76.4 |
|
| 461 |
+
| | | | | | | |
|
| 462 |
+
| **MossTTSDelay** | **8B** | ✅ | 1.79 | 71.46 | 1.32 | 77.05 |
|
| 463 |
+
| **MossTTSLocal** | **1.7B** | ✅ | 1.85 | **73.42** | 1.2 | **78.82** |
|
| 464 |
+
|
| 465 |
+
### MOSS‑TTSD
|
| 466 |
+
|
| 467 |
+
#### Objective Evaluation
|
| 468 |
+
We evaluate MOSS‑TTSD-v1.0 using three objective metrics: Speaker Attribution Accuracy (ACC), Speaker Similarity (SIM), and Word Error Rate (WER). Benchmarked against multiple open-source and closed-source models, the results show that MOSS‑TTSD-v1.0 consistently achieves either the best or second-best performance.
|
| 469 |
+
|
| 470 |
+
| Model | ZH - SIM | ZH - ACC | ZH - WER | EN - SIM | EN - ACC | EN - WER |
|
| 471 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 472 |
+
| **Comparison with Open-Source Models** | | | | | | |
|
| 473 |
+
| **MOSS-TTSD-v1.0** | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |
|
| 474 |
+
| MOSS-TTSD-v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |
|
| 475 |
+
| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |
|
| 476 |
+
| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |
|
| 477 |
+
| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |
|
| 478 |
+
| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |
|
| 479 |
+
| **Comparison with Proprietary Models** | | | | | | |
|
| 480 |
+
| **MOSS-TTSD-v1.0 (elevenlabs_voice)** | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |
|
| 481 |
+
| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |
|
| 482 |
+
| | | | | | | |
|
| 483 |
+
| **MOSS-TTSD-v1.0 (gemini_voice)** | - | - | - | **0.7893** | **0.9655** | 0.0984 |
|
| 484 |
+
| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |
|
| 485 |
+
| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |
|
| 486 |
+
| | | | | | | |
|
| 487 |
+
| **MOSS-TTSD-v1.0 (doubao_voice)** | **0.8226** | **0.9630** | 0.0571 | - | - | - |
|
| 488 |
+
| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |
|
| 489 |
+
|
| 490 |
+
#### Subjective Evaluation
|
| 491 |
+
For open-source models, annotators are asked to score each sample pair in terms of speaker attribution accuracy, voice similarity, prosody, and overall quality. Following the methodology of the LMSYS Chatbot Arena, we compute Elo ratings and confidence intervals for each dimension.
|
| 492 |
+

|
| 493 |
+
|
| 494 |
+
For closed-source models, annotators are only asked to choose the overall preferred one in each pair, and we compute the win rate accordingly.
|
| 495 |
+

|
| 496 |
+
|
| 497 |
+
|
| 498 |
+
### MOSS‑VoiceGenerator
|
| 499 |
+
MOSS‑VoiceGenerator demonstrates strong subjective preference across **overall preference**, **instruction following**, and **naturalness**.
|
| 500 |
+
|
| 501 |
+
<p align="center">
|
| 502 |
+
<img src="./assets/moss_voice_generator_winrate.png" width="70%" />
|
| 503 |
+
</p>
|
| 504 |
+
|
| 505 |
+
## MOSS-Audio-Tokenizer
|
| 506 |
+
|
| 507 |
+
<a id="mat-intro"></a>
|
| 508 |
+
### Introduction
|
| 509 |
+
**MOSS-Audio-Tokenizer** serves as the unified discrete audio interface for the entire MOSS-TTS Family. It is based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture—a 1.6-billion-parameter, "CNN-free" homogeneous audio tokenizer built entirely from Causal Transformer blocks.
|
| 510 |
+
|
| 511 |
+
- **Unified Discrete Bridge**: It acts as the shared backbone for MOSS-TTS, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, and MOSS-TTS-Realtime, providing a consistent audio representation across the family.
|
| 512 |
+
- **Extreme Compression & High Fidelity**: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across variable bitrates from 0.125kbps to 4kbps.
|
| 513 |
+
- **Massive-Scale General Audio Training**: Trained from scratch on 3 million hours of diverse data (speech, sound effects, and music), the model achieves state-of-the-art reconstruction among open source audio tokenizers.
|
| 514 |
+
- **Native Streaming Design**: The pure Causal Transformer architecture is specifically designed for scalability and low-latency streaming inference, enabling real-time production workflows.
|
| 515 |
+
|
| 516 |
+
To learn more about setup, advanced usage, and evaluation metrics, please visit the [MOSS-Audio-Tokenizer Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
|
| 517 |
+
|
| 518 |
+
<p align="center">
|
| 519 |
+
<img src="./assets/arch_moss_audio_tokenizer.png" alt="MOSS Audio Tokenizer architecture" width="100%" />
|
| 520 |
+
Architecture of MOSS Audio Tokenizer
|
| 521 |
+
</p>
|
| 522 |
+
|
| 523 |
+
### Model Weights
|
| 524 |
+
|
| 525 |
+
| Model | Hugging Face | ModelScope |
|
| 526 |
+
|:-----:|:------------:|:----------:|
|
| 527 |
+
| **MOSS-Audio-Tokenizer** | [](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | [](https://modelscope.cn/models/openmoss/MOSS-Audio-Tokenizer) |
|
| 528 |
+
|
| 529 |
+
### Objective Reconstruction Evaluation
|
| 530 |
+
|
| 531 |
+
We compare **MOSS Audio Tokenizer** with open-source audio tokenizers on the LibriSpeech test-clean subset using SIM, STOI, PESQ-NB, and PESQ-WB. Bitrate is controlled by varying the number of RVQ codebooks during decoding, and MOSS Audio Tokenizer leads reconstruction quality among open-source audio tokenizers at comparable 0–4 kbps bitrates.
|
| 532 |
+
|
| 533 |
+
<p align="center">
|
| 534 |
+
<img src="./assets/evaluation_moss_audio_tokenizer.png" alt="LibriSpeech objective metrics for audio tokenizers" width="90%" />
|
| 535 |
+
</p>
|
| 536 |
+
|
| 537 |
+
## LICENSE
|
| 538 |
+
|
| 539 |
+
Models in MOSS-TTS Family are licensed under the Apache License 2.0.
|
| 540 |
+
|
| 541 |
+
## Citation
|
| 542 |
+
|
| 543 |
+
```bibtex
|
| 544 |
+
```
|
| 545 |
+
|
| 546 |
+
## Star History
|
| 547 |
+
|
| 548 |
+
[](https://www.star-history.com/#OpenMOSS/MOSS-TTS&type=date&legend=top-left)
|
README_zh.md
ADDED
|
@@ -0,0 +1,534 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS 家族
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
<br>
|
| 6 |
+
|
| 7 |
+
<p align="center" style="display:flex; justify-content:center; align-items:center; gap:24px;">
|
| 8 |
+
<img src="./assets/OpenMOSS_Logo.png" height="80" style="display:block; transform: translateY(0px);" />
|
| 9 |
+
<img src="./assets/mosi-logo.png" height="50" style="display:block; transform: translateY(-8px);" />
|
| 10 |
+
</p>
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<a href="https://clawhub.ai/luogao2333/moss-tts-voice"><img src="https://img.shields.io/badge/🦞_OpenClaw-Skills-8A2BE2" alt="OpenClaw"></a>
|
| 16 |
+
<a href="https://huggingface.co/collections/OpenMOSS-Team/moss-tts"><img src="https://img.shields.io/badge/Huggingface-Models-orange?logo=huggingface&"></a>
|
| 17 |
+
<a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&"></a>
|
| 18 |
+
<a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&"></a>
|
| 19 |
+
<a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&"></a>
|
| 20 |
+
|
| 21 |
+
<a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&"></a>
|
| 22 |
+
<a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&"></a>
|
| 23 |
+
<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&"></a>
|
| 24 |
+
<a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&"></a>
|
| 25 |
+
<a href="./assets/wechat.jpg"><img src="https://img.shields.io/badge/WeChat-Join-07C160?logo=wechat&logoColor=white" alt="WeChat"></a>
|
| 26 |
+
|
| 27 |
+
</div>
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
[English](README.md) | [简体中文](README_zh.md)
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
MOSS‑TTS 家族是由 [MOSI.AI](https://mosi.cn/#hero) 与 [OpenMOSS 团队](https://www.open-moss.com/) 推出的开源 **语音与声音生成模型家族**。该系列面向 **高保真**、**高表现力** 与 **复杂真实场景** 设计,覆盖稳定长文本语音、多说话人对话、音色/角色设计、环境音效以及实时流式 TTS 等能力。
|
| 34 |
+
|
| 35 |
+
<a id="news"></a>
|
| 36 |
+
## 新闻
|
| 37 |
+
* 2026.3.10:⚡️ 大幅优化了 llama.cpp 推理管线的显存占用。现在 8B 模型可以运行在 8GB 显存的 GPU 上!
|
| 38 |
+
* 2026.3.4:新增 **无 PyTorch 推理** 支持 — 通过 [llama.cpp](https://github.com/ggerganov/llama.cpp) + ONNX Runtime 实现端侧轻量部署。量化 GGUF 权重发布于 [`OpenMOSS-Team/MOSS-TTS-GGUF`](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF),ONNX 音频编解码器发布于 [`OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX`](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX)。详见 [llama.cpp 后端](#llamacpp-后端无-pytorch-推理)。
|
| 39 |
+
* 2026.3.4:🎉 我们在 🦞 龙虾 的 [ClawHub](https://clawhub.ai) 平台上架了 MOSS-TTS skills:[feishu-voice-tts](https://clawhub.ai/helloeveryworlds/feishu-voice-tts) 与 [moss-tts-voice](https://clawhub.ai/luogao2333/moss-tts-voice)。
|
| 40 |
+
* 2026.2.10:🎉🎉🎉 我们已发布 [MOSS-TTS Family](https://huggingface.co/collections/OpenMOSS-Team/moss-tts)。更多详情请查看我们的 [Blog](https://mosi.cn/#models)!我们的 Huggingface Space 在这里:[MOSS-TTS](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS), [MOSS-TTSD-v1.0](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTSD-v1.0), [MOSS-VoiceGenerator](https://huggingface.co/spaces/OpenMOSS-Team/MOSS-VoiceGenerator).
|
| 41 |
+
|
| 42 |
+
## 演示
|
| 43 |
+
|
| 44 |
+
<div align="center">
|
| 45 |
+
<video src="https://gist.github.com/user-attachments/assets/fdce9f66-20ec-45e8-9615-89606ae2fbe8" width="70%" poster=""> </video>
|
| 46 |
+
</div>
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## 目录
|
| 50 |
+
|
| 51 |
+
- [介绍](#introduction)
|
| 52 |
+
- [模型架构](#architecture)
|
| 53 |
+
- [已发布模型](#released-models)
|
| 54 |
+
- [支持的语言](#supported-languages)
|
| 55 |
+
- [快速开始](#quickstart)
|
| 56 |
+
- [OpenClaw API Skills](#openclaw-api-skills)
|
| 57 |
+
- [环境准备](#environment-setup)
|
| 58 |
+
- [(可选)安装 FlashAttention 2](#optional-install-flashattention-2)
|
| 59 |
+
- [基础用法](#moss-tts-basic-usage)
|
| 60 |
+
- [微调](#fine-tuning)
|
| 61 |
+
- [llama.cpp 后端(无 PyTorch 推理)](#llamacpp-后端无-pytorch-推理)
|
| 62 |
+
- [评测](#evaluation)
|
| 63 |
+
- [MOSS-TTS 评测](#eval-moss-tts)
|
| 64 |
+
- [MOSS-TTSD 评测](#eval-moss-ttsd)
|
| 65 |
+
- [MOSS-VoiceGenerator 评测](#eval-moss-voicegenerator)
|
| 66 |
+
- [语音编解码器](#audio-tokenizer)
|
| 67 |
+
- [介绍](#audio-tokenizer-intro)
|
| 68 |
+
- [模型权重](#model-weights)
|
| 69 |
+
- [重建质量客观评测](#重建质量客观评测)
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
<a id="introduction"></a>
|
| 73 |
+
## 介绍
|
| 74 |
+
|
| 75 |
+
<p align="center">
|
| 76 |
+
<img src="./assets/moss_tts_family.jpeg" width="85%" />
|
| 77 |
+
</p>
|
| 78 |
+
|
| 79 |
+
当一段音频需要 **听起来像真实的人类**、**准确发音**、**在不同内容间切换说话风格**、**稳定持续数十分钟**,并且 **支持对话、角色扮演与实时交互** 时,单一 TTS 模型往往不足以胜任。**MOSS‑TTS 家族**将工作流拆分为 5 个可独立使用、亦可组合成完整管线的量产级模型。
|
| 80 |
+
|
| 81 |
+
- **MOSS‑TTS**:MOSS‑TTS 是家族中的旗舰量产级 TTS 基础模型,**核心能力是高保真以及最优性能的零样本语音克隆**,支持**长文本长语音生成**、**拼音、音标与时长精细控制**,以及**多语种/中英混合合成**。它可作为大规模旁白、配音和语音产品的核心底座。
|
| 82 |
+
- **MOSS‑TTSD**:MOSS‑TTSD 是对话语音生成模型,用于生成高表现力、多说话人、超长连续对话的音频。本次我们更新了全新的**v1.0版本**,相比于0.7版本,它在音色相似度,说话人切换准确率,词错误率等**客观指标上取得了业界最优的性能**,在竞技场主观评测中,也**战胜了豆包、Gemini2.5-pro**等顶尖闭源模型。详情请访问 [MOSS-TTSD 仓库](https://github.com/OpenMOSS/MOSS-TTSD)。
|
| 83 |
+
- **MOSS‑VoiceGenerator**:MOSS‑VoiceGenerator 是开源音色设计模型,可从文本风格指令直接生成多样的说话人音色或风格,**无需参考音频**。它统一音色设计、风格控制与内容合成,可独立创作,也可作为下游 TTS 的音色设计层。模型性能在**竞技场评分上超过了其余等顶尖音色设计模型**。
|
| 84 |
+
- **MOSS‑TTS‑Realtime**:MOSS‑TTS‑Realtime 是面向实时语音智能体的多轮上下文感知实时 TTS 模型。它结合多轮对话中的文本与历史语音信号进行低时延增量合成,使多轮回复保持连贯、自然且音色一致。**非常适合搭配文本模型构建低时延语音智能体**。
|
| 85 |
+
- **MOSS‑SoundEffect**:MOSS‑SoundEffect 是面向内容制作的**音效生成**模型,具备广泛类别覆盖与可控时长能力。它能根据文本指令生成自然环境、城市场景、生物、人类动作与类音乐片段等音频,适用于影视、游戏、交互体验和数据合成。
|
| 86 |
+
|
| 87 |
+
<a id="architecture"></a>
|
| 88 |
+
## 模型架构
|
| 89 |
+
|
| 90 |
+
我们在统一训练/评测框架下将 **MossTTSDelay** 与 **MossTTSLocal** 作为互补基线:**Delay** 更强调长上下文稳定性、推理速度与工程可用性,**Local** 更强调轻量灵活和面向流式场景的客观指标表现。二者共同提供可复现、可对比的落地与研究参考。
|
| 91 |
+
|
| 92 |
+
**MossTTSRealtime** 不是第三个对比基线,而是面向语音智能体的能力型设计。它同时利用历史文本与用户语音声学信息建模多轮上下文,以低时延流式合成保持回复连贯和音色一致。
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
| 架构 | 核心机制 | 架构细节 |
|
| 96 |
+
|---|---|---|
|
| 97 |
+
| `MossTTSDelay` | 多头并行 RVQ 预测,结合延迟模式调度 | [](moss_tts_delay/README.md) |
|
| 98 |
+
| `MossTTSLocal` | 基于深度 Transformer 的时间同步 RVQ 模块 | [](moss_tts_local/README.md) |
|
| 99 |
+
| `MossTTSRealtime` | 用于实时合成的分层文本-音频输入 | [](moss_tts_realtime/README.md) |
|
| 100 |
+
|
| 101 |
+
<a id="released-models"></a>
|
| 102 |
+
## 模型概览
|
| 103 |
+
|
| 104 |
+
| Model | Architecture | Size | Model Card | Hugging Face | ModelScope |
|
| 105 |
+
|---|---|---:|---|---|---|
|
| 106 |
+
| **MOSS-TTS** | `MossTTSDelay` | 8B | [](docs/moss_tts_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) | [](https://modelscope.cn/models/openmoss/MOSS-TTS) |
|
| 107 |
+
| | `MossTTSLocal` | 1.7B | [](docs/moss_tts_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) | [](https://modelscope.cn/models/openmoss/MOSS-TTS-Local-Transformer) |
|
| 108 |
+
| **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 8B | [](docs/moss_ttsd_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) | [](https://modelscope.cn/models/openmoss/MOSS-TTSD-v1.0) |
|
| 109 |
+
| **MOSS‑VoiceGenerator** | `MossTTSDelay` | 1.7B | [](docs/moss_voice_generator_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-VoiceGenerator) | [](https://modelscope.cn/models/openmoss/MOSS-VoiceGenerator) |
|
| 110 |
+
| **MOSS‑SoundEffect** | `MossTTSDelay` | 8B | [](docs/moss_sound_effect_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) | [](https://modelscope.cn/models/openmoss/MOSS-SoundEffect) |
|
| 111 |
+
| **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 1.7B | [](docs/moss_tts_realtime_model_card.md) | [](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) | [](https://modelscope.cn/models/openmoss/MOSS-TTS-Realtime) |
|
| 112 |
+
|
| 113 |
+
<a id="supported-languages"></a>
|
| 114 |
+
|
| 115 |
+
## 支持的语言
|
| 116 |
+
|
| 117 |
+
MOSS-TTS、MOSS-TTSD 和 MOSS-TTS-Realtime 目前支持 **20 种语言**:
|
| 118 |
+
|
| 119 |
+
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|
| 120 |
+
|---|---|---|---|---|---|---|---|---|
|
| 121 |
+
| 中文 | zh | 🇨🇳 | 英语 | en | 🇺🇸 | 德语 | de | 🇩🇪 |
|
| 122 |
+
| 西班牙语 | es | 🇪🇸 | 法语 | fr | 🇫🇷 | 日语 | ja | 🇯🇵 |
|
| 123 |
+
| 意大利语 | it | 🇮🇹 | 匈牙利语 | hu | 🇭🇺 | 韩语 | ko | 🇰🇷 |
|
| 124 |
+
| 俄语 | ru | 🇷🇺 | 波斯语(法尔西语) | fa | 🇮🇷 | 阿拉伯语 | ar | 🇸🇦 |
|
| 125 |
+
| 波兰语 | pl | 🇵🇱 | 葡萄牙语 | pt | 🇵🇹 | 捷克语 | cs | 🇨🇿 |
|
| 126 |
+
| 丹麦语 | da | 🇩🇰 | 瑞典语 | sv | 🇸🇪 | | | |
|
| 127 |
+
| 希腊语 | el | 🇬🇷 | 土耳其语 | tr | 🇹🇷 | | | |
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
<a id="quickstart"></a>
|
| 132 |
+
## 快速开始
|
| 133 |
+
|
| 134 |
+
### OpenClaw API Skills
|
| 135 |
+
|
| 136 |
+
我们在🦞 龙虾 的 [ClawHub](https://clawhub.ai) 平台上架了 MOSS-TTS skills。API Key 可在 [MOSI AI Studio](https://studio.mosi.cn) 获取。
|
| 137 |
+
|
| 138 |
+
| Skill | 说明 | 安装命令 |
|
| 139 |
+
|---|---|---|
|
| 140 |
+
| [`feishu-voice-tts`](https://clawhub.ai/helloeveryworlds/feishu-voice-tts) | 在飞书发送语音消息 | `clawhub install feishu-voice-tts` |
|
| 141 |
+
| [`moss-tts-voice`](https://clawhub.ai/luogao2333/moss-tts-voice) | 调用 MOSS-TTS API 生成语音 | `clawhub install moss-tts-voice` |
|
| 142 |
+
|
| 143 |
+
<a id="environment-setup"></a>
|
| 144 |
+
### 环境准备
|
| 145 |
+
|
| 146 |
+
建议使用干净的 Python 环境。
|
| 147 |
+
|
| 148 |
+
#### 使用 Conda
|
| 149 |
+
|
| 150 |
+
```bash
|
| 151 |
+
conda create -n moss-tts python=3.12 -y
|
| 152 |
+
conda activate moss-tts
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
安装全部依赖:
|
| 156 |
+
|
| 157 |
+
```bash
|
| 158 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 159 |
+
cd MOSS-TTS
|
| 160 |
+
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
#### 使用 `uv`
|
| 164 |
+
|
| 165 |
+
```bash
|
| 166 |
+
# 请先安装 uv:https://docs.astral.sh/uv/getting-started/installation/
|
| 167 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 168 |
+
cd MOSS-TTS
|
| 169 |
+
uv venv --python 3.12 .venv
|
| 170 |
+
source .venv/bin/activate
|
| 171 |
+
uv pip install --torch-backend cu128 -e ".[torch-runtime]"
|
| 172 |
+
```
|
| 173 |
+
<a id="optional-install-flashattention-2"></a>
|
| 174 |
+
#### (可选)安装 FlashAttention 2
|
| 175 |
+
|
| 176 |
+
如果你的硬件支持,可以安装 FlashAttention 2 以提升速度并降低显存占用。
|
| 177 |
+
|
| 178 |
+
如果你使用 Conda/pip:
|
| 179 |
+
|
| 180 |
+
```bash
|
| 181 |
+
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
如果机器内存较小、CPU 核数较多,可以限制并行编译数:
|
| 185 |
+
|
| 186 |
+
```bash
|
| 187 |
+
MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
如果你使用 `uv`:
|
| 191 |
+
|
| 192 |
+
```bash
|
| 193 |
+
uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
如果机器内存较小、CPU 核心较多,可以限制并行编译数:
|
| 197 |
+
|
| 198 |
+
```bash
|
| 199 |
+
MAX_JOBS=4 uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
说明:
|
| 203 |
+
- 依赖统一在 `pyproject.toml` 中管理,当前固定了 `torch==2.9.1+cu128` 和 `torchaudio==2.9.1+cu128`。
|
| 204 |
+
- `uv` 方案中使用 `--torch-backend cu128`,由 uv 处理 PyTorch CUDA 轮子来源,同时其余依赖仍使用默认安全索引策略解析。
|
| 205 |
+
- 如果需要其他后端,可将 `cu128` 替换为目标后端(例如 `cpu`、`cu126`)。
|
| 206 |
+
- 如果 FlashAttention 2 编译失败,可以跳过,直接使用默认 attention 后端。
|
| 207 |
+
- FlashAttention 2 仅支持部分 GPU,通常搭配 `torch.float16` 或 `torch.bfloat16` 使用。
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
<a id="moss-tts-basic-usage"></a>
|
| 211 |
+
### MOSS‑TTS 基础用法
|
| 212 |
+
|
| 213 |
+
如果你更希望使用 Gradio 界面,我们为 4 个主模型提供了对应脚本:
|
| 214 |
+
|
| 215 |
+
| Model | Script |
|
| 216 |
+
|---|---|
|
| 217 |
+
| MOSS-TTS | [clis/moss_tts_app.py](clis/moss_tts_app.py) |
|
| 218 |
+
| MOSS-TTSD | [clis/moss_ttsd_app.py](clis/moss_ttsd_app.py) |
|
| 219 |
+
| MOSS-VoiceGenerator | [clis/moss_voice_generator_app.py](clis/moss_voice_generator_app.py) |
|
| 220 |
+
| MOSS-SoundEffect | [clis/moss_sound_effect_app.py](clis/moss_sound_effect_app.py) |
|
| 221 |
+
|
| 222 |
+
MOSS-TTS-Realtime 的 Gradio demo 请直接参考 [MOSS-TTS-Realtime Model Card](docs/moss_tts_realtime_model_card.md)
|
| 223 |
+
|
| 224 |
+
```python
|
| 225 |
+
from pathlib import Path
|
| 226 |
+
import importlib.util
|
| 227 |
+
import torch
|
| 228 |
+
import torchaudio
|
| 229 |
+
from transformers import AutoModel, AutoProcessor
|
| 230 |
+
# Disable the broken cuDNN SDPA backend
|
| 231 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 232 |
+
# Keep these enabled as fallbacks
|
| 233 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 234 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 235 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"
|
| 239 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 240 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 241 |
+
|
| 242 |
+
def resolve_attn_implementation() -> str:
|
| 243 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 244 |
+
if (
|
| 245 |
+
device == "cuda"
|
| 246 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 247 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 248 |
+
):
|
| 249 |
+
major, _ = torch.cuda.get_device_capability()
|
| 250 |
+
if major >= 8:
|
| 251 |
+
return "flash_attention_2"
|
| 252 |
+
|
| 253 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 254 |
+
if device == "cuda":
|
| 255 |
+
return "sdpa"
|
| 256 |
+
|
| 257 |
+
# CPU fallback.
|
| 258 |
+
return "eager"
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
attn_implementation = resolve_attn_implementation()
|
| 262 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 263 |
+
|
| 264 |
+
processor = AutoProcessor.from_pretrained(
|
| 265 |
+
pretrained_model_name_or_path,
|
| 266 |
+
trust_remote_code=True,
|
| 267 |
+
)
|
| 268 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 269 |
+
|
| 270 |
+
text_1 = "亲爱的你,\n你好呀。\n\n今天,我想用最认真、最温柔的声音,对你说一些重要的话。\n这些话,像一颗小小的星星,希望能在你的心里慢慢发光。\n\n首先,我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候,\n窗外有光,屋子里很安静,\n你的心是轻轻的,没有着急,也没有害怕。\n\n希望你吃饭的时候胃口很好,\n走路的时候脚步稳稳,\n晚上睡觉的时候,能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题,\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候,\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友,\n温柔地对待小动物,\n也温柔地对待自己。\n\n如果有一天你犯了错,\n请不要太快责怪自己,\n因为每一个认真成长的人,\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时,\n当你第一次举手发言时,\n当你遇到困难、感到害怕的时候,\n希望你能轻轻地告诉自己:\n“我可以试一试。”\n\n就算没有一次成功,也没有关系。\n失败不是坏事,\n它只是告诉你,你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人,\n把笑声送给身边的人,\n因为快乐被分享的时候,\n会变得更大、更亮。\n\n如果有一天你感到难过,\n我希望你知道——\n难过并不丢脸,\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方,\n慢慢把心里的话说出来,\n然后再一次抬起头,看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大,\n也许很小,\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大,\n在时间里慢慢变得清楚。\n\n最后,我想送你一个最最重要的祝福:\n\n愿你被世界温柔对待,\n也愿你成为一个温柔的人。\n\n愿你的每一天,\n都值得被记住,\n都值得被珍惜。\n\n亲爱的你,\n请记住,\n你是独一无二的,\n你已经很棒了,\n而你的未来,\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福,\n祝你永远带着笑容向前走。"
|
| 271 |
+
text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."
|
| 272 |
+
text_3 = "nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?"
|
| 273 |
+
text_4 = "nin2 hao3,qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3?"
|
| 274 |
+
text_5 = "您好,请问您来自哪 zuo4 cheng2 shi4?"
|
| 275 |
+
text_6 = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"
|
| 276 |
+
|
| 277 |
+
# Use audio from ./assets/audio to avoid downloading from the cloud.
|
| 278 |
+
ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
|
| 279 |
+
ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
|
| 280 |
+
|
| 281 |
+
conversations = [
|
| 282 |
+
# Direct TTS (no reference)
|
| 283 |
+
[processor.build_user_message(text=text_1)],
|
| 284 |
+
[processor.build_user_message(text=text_2)],
|
| 285 |
+
# Pinyin or IPA input
|
| 286 |
+
[processor.build_user_message(text=text_3)],
|
| 287 |
+
[processor.build_user_message(text=text_4)],
|
| 288 |
+
[processor.build_user_message(text=text_5)],
|
| 289 |
+
[processor.build_user_message(text=text_6)],
|
| 290 |
+
# Voice cloning (with reference)
|
| 291 |
+
[processor.build_user_message(text=text_1, reference=[ref_audio_1])],
|
| 292 |
+
[processor.build_user_message(text=text_2, reference=[ref_audio_2])],
|
| 293 |
+
# Duration control
|
| 294 |
+
[processor.build_user_message(text=text_2, tokens=325)],
|
| 295 |
+
[processor.build_user_message(text=text_2, tokens=600)],
|
| 296 |
+
]
|
| 297 |
+
|
| 298 |
+
model = AutoModel.from_pretrained(
|
| 299 |
+
pretrained_model_name_or_path,
|
| 300 |
+
trust_remote_code=True,
|
| 301 |
+
attn_implementation=attn_implementation,
|
| 302 |
+
torch_dtype=dtype,
|
| 303 |
+
).to(device)
|
| 304 |
+
model.eval()
|
| 305 |
+
|
| 306 |
+
batch_size = 1
|
| 307 |
+
|
| 308 |
+
save_dir = Path("inference_root")
|
| 309 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 310 |
+
sample_idx = 0
|
| 311 |
+
with torch.no_grad():
|
| 312 |
+
for start in range(0, len(conversations), batch_size):
|
| 313 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 314 |
+
batch = processor(batch_conversations, mode="generation")
|
| 315 |
+
input_ids = batch["input_ids"].to(device)
|
| 316 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 317 |
+
|
| 318 |
+
outputs = model.generate(
|
| 319 |
+
input_ids=input_ids,
|
| 320 |
+
attention_mask=attention_mask,
|
| 321 |
+
max_new_tokens=4096,
|
| 322 |
+
)
|
| 323 |
+
|
| 324 |
+
for message in processor.decode(outputs):
|
| 325 |
+
audio = message.audio_codes_list[0]
|
| 326 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 327 |
+
sample_idx += 1
|
| 328 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 329 |
+
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
各模型的完整使用方式请参考对应的 model card。
|
| 333 |
+
|
| 334 |
+
<a id="fine-tuning"></a>
|
| 335 |
+
### 微调
|
| 336 |
+
|
| 337 |
+
微调教程按架构分别组织。
|
| 338 |
+
|
| 339 |
+
当前已提供:
|
| 340 |
+
|
| 341 |
+
- `MossTTSDelay` / `OpenMOSS-Team/MOSS-TTS`:见 [moss_tts_delay/finetuning/README_zh.md](moss_tts_delay/finetuning/README_zh.md)
|
| 342 |
+
|
| 343 |
+
后续其余架构的微调教程也会分别补充到对应目录下。
|
| 344 |
+
|
| 345 |
+
## llama.cpp 后端(无 PyTorch 推理)
|
| 346 |
+
|
| 347 |
+
MOSS-TTS 支持使用 [llama.cpp](https://github.com/ggerganov/llama.cpp) 运行 Qwen3 backbone,配合 ONNX Runtime / TensorRT 运行音频编解码器,实现 **完全无 PyTorch 依赖** 的轻量端侧推理。
|
| 348 |
+
|
| 349 |
+
### 快速开始
|
| 350 |
+
|
| 351 |
+
```bash
|
| 352 |
+
# 1. 安装(无 PyTorch)
|
| 353 |
+
pip install -e ".[llama-cpp-onnx]"
|
| 354 |
+
|
| 355 |
+
# 2. 下载预量化 backbone + embedding/lm_head 权重
|
| 356 |
+
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF
|
| 357 |
+
|
| 358 |
+
# 3. 下载 ONNX 音频编解码器
|
| 359 |
+
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX
|
| 360 |
+
|
| 361 |
+
# 4. 编译 C bridge(一次性,需要 llama.cpp 源码编译)
|
| 362 |
+
cd moss_tts_delay/llama_cpp && bash build_bridge.sh /path/to/llama.cpp && cd ../..
|
| 363 |
+
|
| 364 |
+
# 5. 推理
|
| 365 |
+
python -m moss_tts_delay.llama_cpp \
|
| 366 |
+
--config configs/llama_cpp/default.yaml \
|
| 367 |
+
--text "你好世界!" --output output.wav
|
| 368 |
+
|
| 369 |
+
# 6. (可选) 针对 8 GB 显存 GPU 的低显存模式 — 按阶段加载/卸载组件
|
| 370 |
+
python -m moss_tts_delay.llama_cpp \
|
| 371 |
+
--config configs/llama_cpp/trt-8gb.yaml \
|
| 372 |
+
--text "你好世界!" --output output.wav
|
| 373 |
+
```
|
| 374 |
+
|
| 375 |
+
### 安装方案
|
| 376 |
+
|
| 377 |
+
| 方案 | 安装命令 | 依赖 | 适用场景 |
|
| 378 |
+
|------|---------|------|---------|
|
| 379 |
+
| **无 Torch (ONNX)** | `pip install -e ".[llama-cpp-onnx]"` | numpy, onnxruntime-gpu, tokenizers | 推荐入门方案 |
|
| 380 |
+
| **无 Torch (TRT)** | `pip install -e ".[llama-cpp-trt]"` | numpy, tensorrt, cuda-python | 最高音频编解码器性能(需自行编译 engine) |
|
| 381 |
+
| **Torch 加速** | `pip install -e ".[llama-cpp-onnx,llama-cpp-torch]"` | + torch | GPU 加速 LM heads(约 30 倍提速) |
|
| 382 |
+
|
| 383 |
+
> **想要自行转换权重?** 请参阅 [转换指南](moss_tts_delay/llama_cpp/conversion/README_zh.md),了解如何使用 llama.cpp 提取、转换和量化 MOSS-TTS 权重。
|
| 384 |
+
|
| 385 |
+
### 模型权重
|
| 386 |
+
|
| 387 |
+
| 仓库 | 内容 | 下载命令 |
|
| 388 |
+
|------|------|---------|
|
| 389 |
+
| [`OpenMOSS-Team/MOSS-TTS-GGUF`](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Q4_K_M backbone `.gguf`、`embeddings/`(`.npy`)、`lm_heads/`(`.npy`)、tokenizer | `huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF` |
|
| 390 |
+
| [`OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX`](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX) | Encoder & decoder ONNX 模型 | `huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX` |
|
| 391 |
+
|
| 392 |
+
> **注意:** 我们 **不提供** 预编译的 TensorRT engine,因为 TRT engine 与 GPU 架构和 TensorRT 版本强绑定。如需使用 TRT,请从 ONNX 模型自行编译 — 参考 `moss_audio_tokenizer/trt/build_engine.sh`。
|
| 393 |
+
|
| 394 |
+
### 配置
|
| 395 |
+
|
| 396 |
+
`configs/llama_cpp/` 中提供了四个预设配置:
|
| 397 |
+
|
| 398 |
+
- `default.yaml` — ONNX 音频 Tokenizer + GGUF backbone(推荐入门)
|
| 399 |
+
- `trt.yaml` — TensorRT 音频 Tokenizer + GGUF backbone(最大吞吐,需自行编译 engine)
|
| 400 |
+
- `trt-8gb.yaml` — 针对 8 GB 显存 GPU 的低显存模式(分阶段加载,TRT 音频)
|
| 401 |
+
- `cpu-only.yaml` — 纯 CPU 运行(无需 GPU)
|
| 402 |
+
|
| 403 |
+
关键配置项:
|
| 404 |
+
- `heads_backend: auto | numpy | torch` — LM heads 计算后端
|
| 405 |
+
- `audio_backend: onnx | trt | torch` — 音频编解码器后端
|
| 406 |
+
- `low_memory: true | false` — 针对有限显存的分阶段加载(按阶段加载/卸载 encoder, backbone, decoder)
|
| 407 |
+
- `kv_cache_type_k / kv_cache_type_v` — KV cache 量化(例如 `q8_0`, `q4_0`)以减少显存占用
|
| 408 |
+
- `flash_attn: auto | enabled | disabled` — flash attention 用于降低 prefill 阶段的峰值显存
|
| 409 |
+
|
| 410 |
+
完整文档请查看 [moss_tts_delay/llama_cpp/README.md](moss_tts_delay/llama_cpp/README.md)。
|
| 411 |
+
|
| 412 |
+
<a id="evaluation"></a>
|
| 413 |
+
## 评测
|
| 414 |
+
|
| 415 |
+
本节总结 MOSS‑TTS 与 MOSS‑VoiceGenerator 的 **家族级评测亮点**。完整细节请参见各模型的 model card。
|
| 416 |
+
|
| 417 |
+
<a id="eval-moss-tts"></a>
|
| 418 |
+
### MOSS‑TTS 评测
|
| 419 |
+
MOSS‑TTS 在开源零样本 TTS 基准 `Seed‑TTS‑eval` 上取得当前最佳结果,超越所有开源模型,并与主流闭源系统相当。
|
| 420 |
+
|
| 421 |
+
| Model | Params | Open‑source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|
| 422 |
+
|---|---:|:---:|---:|---:|---:|---:|
|
| 423 |
+
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
|
| 424 |
+
| FishAudio‑S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
|
| 425 |
+
| Seed‑TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
|
| 426 |
+
| MiniMax‑Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
|
| 427 |
+
| | | | | | | |
|
| 428 |
+
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |
|
| 429 |
+
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |
|
| 430 |
+
| CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |
|
| 431 |
+
| CosyVoice3 | 1.5B | ✅ | 2.22 | 72 | 1.12 | 78.1 |
|
| 432 |
+
| F5‑TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |
|
| 433 |
+
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |
|
| 434 |
+
| FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |
|
| 435 |
+
| FireRedTTS‑2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |
|
| 436 |
+
| Qwen2.5‑Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |
|
| 437 |
+
| FishAudio‑S1‑mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
|
| 438 |
+
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
|
| 439 |
+
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
|
| 440 |
+
| HiggsAudio‑v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
|
| 441 |
+
| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | **0.93** | 77.2 |
|
| 442 |
+
| Qwen3‑TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
|
| 443 |
+
| Qwen3‑TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |
|
| 444 |
+
| | | | | | | |
|
| 445 |
+
| **MossTTSDelay** | **8B** | ✅ | 1.79 | 71.46 | 1.32 | 77.05 |
|
| 446 |
+
| **MossTTSLocal** | **1.7B** | ✅ | 1.85 | **73.42** | 1.2 | **78.82** |
|
| 447 |
+
|
| 448 |
+
<a id="eval-moss-ttsd"></a>
|
| 449 |
+
### MOSS‑TTSD 评测
|
| 450 |
+
#### 客观评测
|
| 451 |
+
我们使用三个客观指标来评估 MOSS‑TTSD-v1.0 的性能:说话人归属准确性(ACC)、说话人相似度(SIM)和词错误率(WER)。我们对比了 MOSS‑TTSD-v1.0 与多个开源模型和闭源模型的性能,结果如下,MOSS-TTSD-v1.0 均取得了最优或次优性能。
|
| 452 |
+
|
| 453 |
+
| Model | ZH - SIM | ZH - ACC | ZH - WER | EN - SIM | EN - ACC | EN - WER |
|
| 454 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 455 |
+
| **Comparison with Open-Source Models** | | | | | | |
|
| 456 |
+
| **MOSS-TTSD-v1.0** | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |
|
| 457 |
+
| MOSS-TTSD-v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |
|
| 458 |
+
| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |
|
| 459 |
+
| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |
|
| 460 |
+
| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |
|
| 461 |
+
| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |
|
| 462 |
+
| **Comparison with Proprietary Models** | | | | | | |
|
| 463 |
+
| **MOSS-TTSD-v1.0 (elevenlabs_voice)** | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |
|
| 464 |
+
| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |
|
| 465 |
+
| | | | | | | |
|
| 466 |
+
| **MOSS-TTSD-v1.0 (gemini_voice)** | - | - | - | **0.7893** | **0.9655** | 0.0984 |
|
| 467 |
+
| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |
|
| 468 |
+
| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |
|
| 469 |
+
| | | | | | | |
|
| 470 |
+
| **MOSS-TTSD-v1.0 (doubao_voice)** | **0.8226** | **0.9630** | 0.0571 | - | - | - |
|
| 471 |
+
| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |
|
| 472 |
+
|
| 473 |
+
#### 主观评测
|
| 474 |
+
对于开源模型,标注者会从说话人归属准确性、音色相似度、韵律与整体质量等维度对每个样本对进行评分。遵循 LMSYS Chatbot Arena 的方法,我们计算各维度的 Elo 评分与置信区间。
|
| 475 |
+

|
| 476 |
+
|
| 477 |
+
对于闭源模型,标注者只需在每个样本对中选择整体更偏好的一项,并据此计算胜率。
|
| 478 |
+

|
| 479 |
+
|
| 480 |
+
|
| 481 |
+
<a id="eval-moss-voicegenerator"></a>
|
| 482 |
+
### MOSS‑VoiceGenerator 主观评测
|
| 483 |
+
MOSS‑VoiceGenerator 在 **整体偏好**、**指令遵循** 与 **自然度** 上表现出强主观偏好。
|
| 484 |
+
|
| 485 |
+
<p align="center">
|
| 486 |
+
<img src="./assets/moss_voice_generator_winrate.png" width="70%" />
|
| 487 |
+
</p>
|
| 488 |
+
|
| 489 |
+
<a id="audio-tokenizer"></a>
|
| 490 |
+
## 语音编解码器
|
| 491 |
+
|
| 492 |
+
<a id="audio-tokenizer-intro"></a>
|
| 493 |
+
### 介绍
|
| 494 |
+
**MOSS-Audio-Tokenizer** 是 MOSS‑TTS 家族的统一离散音频接口,基于 **Cat**(**C**ausal **A**udio **T**okenizer with **T**ransformer)架构——一个 16 亿参数、完全由 Causal Transformer 块构建的“无 CNN”同构音频 tokenizer。
|
| 495 |
+
|
| 496 |
+
- **统一离散桥接**:为 MOSS‑TTS、MOSS‑TTSD、MOSS‑VoiceGenerator、MOSS‑SoundEffect 与 MOSS‑TTS‑Realtime 提供共享骨干,使家族内音频表示一致。
|
| 497 |
+
- **极致压缩与高保真**:将 24kHz 原始音频压缩到 12.5Hz 的极低帧率;采用 32 层残差向量量化(RVQ),支持从 0.125kbps 到 4kbps 的可变码率高保真重建。
|
| 498 |
+
- **超大规模通用音频训练**:从零训练,使用 300 万小时多样化数据(语音、音效与音乐),在开源音频 tokenizer 中达到 SOTA 级重建效果。
|
| 499 |
+
- **原生流式设计**:纯 Causal Transformer 架构专为可扩展性与低时延流式推理而设计,支持实时生产流程。
|
| 500 |
+
|
| 501 |
+
如需了解更多配置、进阶用法与评测指标,请访问 [MOSS-Audio-Tokenizer 仓库](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)。
|
| 502 |
+
|
| 503 |
+
<p align="center">
|
| 504 |
+
<img src="./assets/arch_moss_audio_tokenizer.png" alt="MOSS Audio Tokenizer 架构示意" width="100%" />
|
| 505 |
+
MOSS Audio Tokenizer 架构图
|
| 506 |
+
</p>
|
| 507 |
+
|
| 508 |
+
<a id="model-weights"></a>
|
| 509 |
+
### 模型权重
|
| 510 |
+
|
| 511 |
+
| Model | Hugging Face | ModelScope |
|
| 512 |
+
|:-----:|:------------:|:----------:|
|
| 513 |
+
| **MOSS-Audio-Tokenizer** | [](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | [](https://modelscope.cn/models/openmoss/MOSS-Audio-Tokenizer) |
|
| 514 |
+
|
| 515 |
+
### 重建质量客观评测
|
| 516 |
+
|
| 517 |
+
我们在 LibriSpeech test-clean 子集上,对比 **MOSS Audio Tokenizer** 与多个开源音频 tokenizer 的 SIM、STOI、PESQ-NB、PESQ-WB 指标,并通过调节 RVQ 码本数量来控制码率。MOSS Audio Tokenizer 在 0–4 kbps 的比特率上的重建质量领先其他开源音频 tokenizer。
|
| 518 |
+
|
| 519 |
+
<p align="center">
|
| 520 |
+
<img src="./assets/evaluation_moss_audio_tokenizer.png" alt="LibriSpeech objective metrics for audio tokenizers" width="90%" />
|
| 521 |
+
</p>
|
| 522 |
+
|
| 523 |
+
|
| 524 |
+
## 证书
|
| 525 |
+
|
| 526 |
+
MOSS-TTS 家族中的模型使用 Apache License 2.0 许可证。
|
| 527 |
+
|
| 528 |
+
## 引用
|
| 529 |
+
|
| 530 |
+
```bibtex
|
| 531 |
+
```
|
| 532 |
+
## 星标历史数据
|
| 533 |
+
|
| 534 |
+
[](https://www.star-history.com/#OpenMOSS/MOSS-TTS&type=date&legend=top-left)
|
assets/OpenMOSS_Logo.png
ADDED
|
assets/VS_Open-Source_Models.jpg
ADDED
|
Git LFS Details
|
assets/VS_Proprietary_Models.png
ADDED
|
Git LFS Details
|
assets/arch_moss_audio_tokenizer.png
ADDED
|
Git LFS Details
|
assets/archi_delay.png
ADDED
|
Git LFS Details
|
assets/archi_local.png
ADDED
|
Git LFS Details
|
assets/audio/reference_02_s1.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1816ab428334ba2de49dcb8b0a10e17eb1835f7f1f7bcda13504e88f46bed1e8
|
| 3 |
+
size 249284
|
assets/audio/reference_02_s2.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3bfaef0dd23b0f382630e516a7a8512bed4a16fb07aa9b9c435d2e6e7b0b9215
|
| 3 |
+
size 1134296
|
assets/audio/reference_en.m4a
ADDED
|
Binary file (83.9 kB). View file
|
|
|
assets/audio/reference_en_0.mp3
ADDED
|
Binary file (90.6 kB). View file
|
|
|
assets/audio/reference_en_1.mp3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:82e4fad862ccb12a4ac609623fe6c275d167f7dcfc7866ca740f38ab169935c6
|
| 3 |
+
size 213836
|
assets/audio/reference_en_2.mp3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:582f8e1e3f3792d7b495e159c29a55bb95c4c46e90725c62807b4b12bf341603
|
| 3 |
+
size 322923
|
assets/audio/reference_en_3.mp3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4bd9c6ffb765fda23297fae21725bc174a3092d9687c3606f11d00ae0df9fc1e
|
| 3 |
+
size 107943
|
assets/audio/reference_zh.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e5112b5e2bef2a727534af85da1e56048a5ab5552de7aa7cbb5f48b0fa4f5eec
|
| 3 |
+
size 448172
|
assets/audio/reference_zh_0.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e5112b5e2bef2a727534af85da1e56048a5ab5552de7aa7cbb5f48b0fa4f5eec
|
| 3 |
+
size 448172
|
assets/audio/reference_zh_1.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f4ff19c55d55a37dbbd550e6624a2faf6cfa7fd56a9594456b17fbe3838b2245
|
| 3 |
+
size 1480128
|
assets/audio/reference_zh_2.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e1686d3e2b1fe2f6b079cf6a41a9cd9ba31c8f9d3cfe03ff411dd0359641c0c8
|
| 3 |
+
size 505586
|
assets/audio/reference_zh_3.mp3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cffa7c5d91c28895caf51c38418af9651c82a4e16a8e4c04e10991bf80cc04cc
|
| 3 |
+
size 347949
|
assets/evaluation_moss_audio_tokenizer.png
ADDED
|
Git LFS Details
|
assets/mosi-logo.png
ADDED
|
assets/moss_tts_family.jpeg
ADDED
|
Git LFS Details
|
assets/moss_tts_realtime.jpeg
ADDED
|
Git LFS Details
|
assets/moss_voice_generator_winrate.png
ADDED
|
Git LFS Details
|
assets/text/moss_tts_example_texts.jsonl
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"id":"zh/0","language":"zh","role":"可爱的小女孩","text":"亲爱的你,\n你好呀。\n今天,我想用最认真、最温柔的声音,对你说一些重要的话。\n这些话,像一颗小小的星星,希望能在你的心里慢慢发光。"}
|
| 2 |
+
{"id":"zh/1","language":"zh","role":"吴俊全老师","text":"从1948年9月12日至1949年1月31日,连续组织了震惊世界的辽沈、淮海、平津三个大战役,这一百四十二个昼夜中,双方统帅部和各级指挥部所拍发的电码讯号错综交汇,织成一面无形的网,从大气层覆盖下来,于是便注定了中国的山川将会怎样排列,流云又当如何变幻。"}
|
| 3 |
+
{"id":"zh/2","language":"zh","role":"原神胡桃","text":"嘿——你在听吗?\n嗯,不说话也没关系啦,反正我已经习惯自言自语了。\n我是胡桃,往生堂第七十七代堂主。\n别紧张别紧张,我今天不是来“请你喝茶”的——至少现在还不是。\n很多人一听到“往生堂”,就皱起眉头,好像我一开口,空气都要凉三分。\n可你看啊,太阳每天都会落下,可谁会因此不喜欢黄昏呢?\n生与死也是一样的道理嘛——\n不是终点,而是换一条路走走。"}
|
| 4 |
+
{"id":"zh/3","language":"zh","role":"明星杨幂","text":"有些人喜欢被照顾,\n而我更习惯照亮自己。\n\n不是不需要依靠,\n只是明白——\n真正能陪你走到最后的,\n从来都不是运气。\n\n我见过凌晨四点的城市,\n也见过掌声散去后的安静。\n那些看起来毫不费力的从容,\n其实都藏着一次次咬牙坚持。"}
|
| 5 |
+
{"id":"en/0","language":"en","role":"Taylor Swift","text":"Tonight, I just want to take a second and breathe this in with you.\nBecause moments like this don’t happen by accident. They’re built—one lyric at a time, one late night at a time, one brave decision at a time. They’re built by people who keep showing up, even when life is loud, even when the world is heavy, even when they’re not sure anyone sees the effort they’re making."}
|
| 6 |
+
{"id":"en/1","language":"en","role":"Iron Man","text":"Look, I know what you’re thinking. Here he goes again. The guy in the metal suit, the walking ego with a repulsor problem, about to make a speech like it’s a press conference and I’m getting paid by the syllable. Relax. This one isn’t for the cameras. No sponsors, no applause, no clever angle that makes me look taller than I already am."}
|
| 7 |
+
{"id":"en/2","language":"en","role":"David Attenborough","text":"In the quiet hours before dawn, the world looks unfinished. Streets are empty, windows are dark, and the air holds its breath as if waiting for a cue. But beneath the stillness, everything is moving. Water is traveling through pipes. Electricity is humming along invisible lines. Seeds are pushing against soil. Somewhere, a hand reaches for a switch, and a day begins."}
|
| 8 |
+
{"id":"en/3","language":"en","role":"Rick Sanchez","text":"Look, you keep staring at the sky like it’s a customer service desk, waiting for the universe to hand you a receipt that says your pain was “worth it.” Newsflash: the cosmos doesn’t do refunds, it does entropy. It does random collisions of atoms that occasionally arrange themselves into a biped with anxiety and a subscription to self-importance."}
|
assets/text/moss_voice_generator_example_texts.jsonl
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"id":"zh/0","language":"zh","instruction":"撕心裂肺,声泪俱下的中年女性","text":"皇上,臣妾做不到啊!皇上,您就杀了臣妾吧!"}
|
| 2 |
+
{"id":"zh/1","language":"zh","instruction":"年轻女性,开头傲慢不屑,发现对方身份后秒怂,疯狂道歉,惊慌失措","text":"你谁啊,关你什么事?啊…王总,您好您好,我不知道是您……"}
|
| 3 |
+
{"id":"zh/2","language":"zh","instruction":"疲惫沙哑的老年声音缓慢抱怨,带有轻微呻吟。","text":"哎呀,我的老腰啊,这年纪大了就是不行了。"}
|
| 4 |
+
{"id":"zh/3","language":"zh","instruction":"粗犷急躁的海盗船长,语速快,语调低沉而充满命令,带着一股不容置疑的霸道。","text":"快点!把那箱金币搬过来!速度快点!别磨磨蹭蹭的!我们必须在涨潮之前离开这里,否则就来不及了!"}
|
| 5 |
+
{"id":"en/0","language":"en","instruction":"Mom scolding kid for breaking a vase, then seeing he cut himself, shifting to concern","text":"How many times have I told you not to run in the house?! You could have…… oh honey, you're bleeding! Let me see your hand…… It's okay, baby."}
|
| 6 |
+
{"id":"en/1","language":"en","instruction":"An elderly female voice, slightly nasal and soft, speaking in a frail, polite British tone, conveying subtle discomfort with gentle hesitation.","text":"Achoo! Oh dear, I do believe I'm catching a cold. This dreadful weather is just too much."}
|
| 7 |
+
{"id":"en/2","language":"en","instruction":"Little girl, innocent and curious, high-pitched and adorable","text":"Mommy, why is the sky blue? And why do birds fly? And why-"}
|
| 8 |
+
{"id":"en/3","language":"en","instruction":"Emotional pop ballad with smooth, melodic delivery, slow tempo with gentle vibrato on sustained notes, conveying hope and vulnerability.","text":"Walking down this empty street tonight, searching for a guiding light, stars above shine oh so bright, everything will be alright"}
|
assets/wechat.jpg
ADDED
|
benchmark_harness.py
ADDED
|
@@ -0,0 +1,310 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Comprehensive benchmark harness for MOSS-TTS realtime optimization.
|
| 4 |
+
|
| 5 |
+
Captures:
|
| 6 |
+
- TTFB (time to first byte)
|
| 7 |
+
- Real-time factor (RTF)
|
| 8 |
+
- End-to-end latency
|
| 9 |
+
- Peak VRAM and steady-state VRAM
|
| 10 |
+
- Per-stage profiling metrics
|
| 11 |
+
- Audio validation via Whisper API
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import os
|
| 15 |
+
import sys
|
| 16 |
+
import json
|
| 17 |
+
import time
|
| 18 |
+
import subprocess
|
| 19 |
+
import argparse
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
from typing import Optional, Dict, Any, List, Tuple
|
| 22 |
+
import statistics
|
| 23 |
+
|
| 24 |
+
import requests
|
| 25 |
+
import numpy as np
|
| 26 |
+
|
| 27 |
+
WHISPER_ENDPOINT = os.getenv("WHISPER_ENDPOINT", "http://100.85.200.54:5092/v1")
|
| 28 |
+
TTS_SERVER_HOST = os.getenv("TTS_SERVER_HOST", "http://localhost:8012")
|
| 29 |
+
|
| 30 |
+
# Test prompts: short, medium, long
|
| 31 |
+
TEST_PROMPTS = {
|
| 32 |
+
"short": "The quick brown fox jumps over the lazy dog.",
|
| 33 |
+
"medium": "Artificial intelligence is transforming how we live, work, and interact. From healthcare to education, AI systems are becoming integral to modern society. Yet, as these systems grow more powerful, the need for safety, transparency, and responsible deployment becomes increasingly critical.",
|
| 34 |
+
"long": "The vast majority of computational models today operate through neural networks that attempt to approximate the behavior of biological brains. These systems are trained on enormous datasets, learning patterns through iterative optimization. However, the mechanisms by which they achieve their remarkable performance remain largely opaque—a phenomenon often described as the 'black box problem'. This opacity creates challenges in critical applications like healthcare and autonomous systems, where understanding model decisions is essential for safety and liability. Recent advances in explainable AI, mechanistic interpretability, and formal verification aim to address these concerns, but significant open questions remain about whether we can ever fully understand the decision-making processes of deep neural networks.",
|
| 35 |
+
}
|
| 36 |
+
|
| 37 |
+
class BenchmarkResult:
|
| 38 |
+
"""Encapsulates a single benchmark run."""
|
| 39 |
+
|
| 40 |
+
def __init__(self, prompt_type: str, prompt_text: str):
|
| 41 |
+
self.prompt_type = prompt_type
|
| 42 |
+
self.prompt_text = prompt_text
|
| 43 |
+
self.ttfb_sec = None
|
| 44 |
+
self.total_latency_sec = None
|
| 45 |
+
self.audio_duration_sec = None
|
| 46 |
+
self.rtf = None
|
| 47 |
+
self.audio_bytes = None
|
| 48 |
+
self.stage_timings = {}
|
| 49 |
+
self.response_headers = {}
|
| 50 |
+
self.whisper_transcript = None
|
| 51 |
+
self.similarity_score = None
|
| 52 |
+
self.error = None
|
| 53 |
+
self.timestamp = time.time()
|
| 54 |
+
|
| 55 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 56 |
+
return {
|
| 57 |
+
"prompt_type": self.prompt_type,
|
| 58 |
+
"prompt_length": len(self.prompt_text),
|
| 59 |
+
"ttfb_sec": self.ttfb_sec,
|
| 60 |
+
"total_latency_sec": self.total_latency_sec,
|
| 61 |
+
"audio_duration_sec": self.audio_duration_sec,
|
| 62 |
+
"rtf": self.rtf,
|
| 63 |
+
"audio_bytes": self.audio_bytes,
|
| 64 |
+
"stage_timings": self.stage_timings,
|
| 65 |
+
"whisper_transcript": self.whisper_transcript,
|
| 66 |
+
"similarity_score": self.similarity_score,
|
| 67 |
+
"error": self.error,
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
def extract_stage_timings(headers: Dict[str, str]) -> Dict[str, float]:
|
| 71 |
+
"""Extract profiling timings from response headers."""
|
| 72 |
+
timings = {}
|
| 73 |
+
for key, value in headers.items():
|
| 74 |
+
if key.lower().startswith("x-moss-stage-"):
|
| 75 |
+
stage_name = key.lower().replace("x-moss-stage-", "").replace("-time-ms", "")
|
| 76 |
+
try:
|
| 77 |
+
timings[stage_name] = float(value) / 1000.0 # Convert ms to sec
|
| 78 |
+
except (ValueError, TypeError):
|
| 79 |
+
pass
|
| 80 |
+
elif key.lower().startswith("x-moss-infer-"):
|
| 81 |
+
infer_name = key.lower().replace("x-moss-infer-", "").replace("-ms", "")
|
| 82 |
+
try:
|
| 83 |
+
timings[infer_name] = float(value) / 1000.0
|
| 84 |
+
except (ValueError, TypeError):
|
| 85 |
+
pass
|
| 86 |
+
return timings
|
| 87 |
+
|
| 88 |
+
def estimate_audio_duration(audio_bytes: int, sample_rate: int = 24000, channels: int = 1, bytes_per_sample: int = 2) -> float:
|
| 89 |
+
"""Estimate audio duration from byte count."""
|
| 90 |
+
if bytes_per_sample <= 0:
|
| 91 |
+
return 0.0
|
| 92 |
+
samples = audio_bytes / (channels * bytes_per_sample)
|
| 93 |
+
return samples / sample_rate
|
| 94 |
+
|
| 95 |
+
def transcribe_audio_with_whisper(audio_data: bytes, format_type: str = "pcm") -> Optional[str]:
|
| 96 |
+
"""Send audio to remote Whisper endpoint for transcription."""
|
| 97 |
+
try:
|
| 98 |
+
# PCM data needs to be wrapped in WAV format for Whisper
|
| 99 |
+
if format_type == "pcm":
|
| 100 |
+
import struct
|
| 101 |
+
sample_rate = 24000
|
| 102 |
+
channels = 1
|
| 103 |
+
wav_data = create_wav_from_pcm(audio_data, sample_rate, channels)
|
| 104 |
+
else:
|
| 105 |
+
wav_data = audio_data
|
| 106 |
+
|
| 107 |
+
files = {"file": ("audio.wav", wav_data, "audio/wav")}
|
| 108 |
+
response = requests.post(
|
| 109 |
+
f"{WHISPER_ENDPOINT}/audio/transcriptions",
|
| 110 |
+
files=files,
|
| 111 |
+
data={"model": "whisper-1"},
|
| 112 |
+
timeout=60
|
| 113 |
+
)
|
| 114 |
+
response.raise_for_status()
|
| 115 |
+
result = response.json()
|
| 116 |
+
return result.get("text", "").strip()
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"Whisper transcription failed: {e}")
|
| 119 |
+
return None
|
| 120 |
+
|
| 121 |
+
def create_wav_from_pcm(pcm_data: bytes, sample_rate: int = 24000, channels: int = 1) -> bytes:
|
| 122 |
+
"""Wrap PCM audio in WAV container."""
|
| 123 |
+
import struct
|
| 124 |
+
|
| 125 |
+
bytes_per_sample = 2 # 16-bit
|
| 126 |
+
num_samples = len(pcm_data) // (channels * bytes_per_sample)
|
| 127 |
+
byte_rate = sample_rate * channels * bytes_per_sample
|
| 128 |
+
block_align = channels * bytes_per_sample
|
| 129 |
+
|
| 130 |
+
wav_header = b"RIFF"
|
| 131 |
+
wav_header += struct.pack("<I", 36 + len(pcm_data))
|
| 132 |
+
wav_header += b"WAVE"
|
| 133 |
+
wav_header += b"fmt "
|
| 134 |
+
wav_header += struct.pack("<I", 16)
|
| 135 |
+
wav_header += struct.pack("<H", 1) # PCM format
|
| 136 |
+
wav_header += struct.pack("<H", channels)
|
| 137 |
+
wav_header += struct.pack("<I", sample_rate)
|
| 138 |
+
wav_header += struct.pack("<I", byte_rate)
|
| 139 |
+
wav_header += struct.pack("<H", block_align)
|
| 140 |
+
wav_header += struct.pack("<H", 16) # Bits per sample
|
| 141 |
+
wav_header += b"data"
|
| 142 |
+
wav_header += struct.pack("<I", len(pcm_data))
|
| 143 |
+
|
| 144 |
+
return wav_header + pcm_data
|
| 145 |
+
|
| 146 |
+
def similarity_score(text1: str, text2: str) -> float:
|
| 147 |
+
"""Simple word overlap similarity (0-1)."""
|
| 148 |
+
words1 = set(text1.lower().split())
|
| 149 |
+
words2 = set(text2.lower().split())
|
| 150 |
+
if not words1 or not words2:
|
| 151 |
+
return 0.0
|
| 152 |
+
intersection = len(words1 & words2)
|
| 153 |
+
union = len(words1 | words2)
|
| 154 |
+
return intersection / union if union > 0 else 0.0
|
| 155 |
+
|
| 156 |
+
def run_benchmark(prompt_type: str, prompt_text: str, validate_audio: bool = False) -> BenchmarkResult:
|
| 157 |
+
"""Run a single benchmark iteration."""
|
| 158 |
+
result = BenchmarkResult(prompt_type, prompt_text)
|
| 159 |
+
|
| 160 |
+
try:
|
| 161 |
+
# Stream audio from TTS server
|
| 162 |
+
start_time = time.time()
|
| 163 |
+
first_byte_time = None
|
| 164 |
+
audio_data = b""
|
| 165 |
+
|
| 166 |
+
response = requests.post(
|
| 167 |
+
f"{TTS_SERVER_HOST}/v1/audio/speech",
|
| 168 |
+
json={
|
| 169 |
+
"input": prompt_text,
|
| 170 |
+
"response_format": "pcm",
|
| 171 |
+
"stream": True
|
| 172 |
+
},
|
| 173 |
+
stream=True,
|
| 174 |
+
timeout=120
|
| 175 |
+
)
|
| 176 |
+
response.raise_for_status()
|
| 177 |
+
result.response_headers = dict(response.headers)
|
| 178 |
+
|
| 179 |
+
for chunk in response.iter_content(chunk_size=8192):
|
| 180 |
+
if first_byte_time is None:
|
| 181 |
+
first_byte_time = time.time()
|
| 182 |
+
audio_data += chunk
|
| 183 |
+
|
| 184 |
+
end_time = time.time()
|
| 185 |
+
result.ttfb_sec = first_byte_time - start_time if first_byte_time else 0
|
| 186 |
+
result.total_latency_sec = end_time - start_time
|
| 187 |
+
result.audio_bytes = len(audio_data)
|
| 188 |
+
|
| 189 |
+
# Extract stage timings from headers
|
| 190 |
+
result.stage_timings = extract_stage_timings(result.response_headers)
|
| 191 |
+
|
| 192 |
+
# Estimate audio duration and compute RTF
|
| 193 |
+
result.audio_duration_sec = estimate_audio_duration(result.audio_bytes)
|
| 194 |
+
if result.total_latency_sec > 0 and result.audio_duration_sec > 0:
|
| 195 |
+
result.rtf = result.total_latency_sec / result.audio_duration_sec
|
| 196 |
+
|
| 197 |
+
# Optionally validate with Whisper
|
| 198 |
+
if validate_audio:
|
| 199 |
+
result.whisper_transcript = transcribe_audio_with_whisper(audio_data)
|
| 200 |
+
if result.whisper_transcript:
|
| 201 |
+
result.similarity_score = similarity_score(prompt_text, result.whisper_transcript)
|
| 202 |
+
|
| 203 |
+
except Exception as e:
|
| 204 |
+
result.error = str(e)
|
| 205 |
+
print(f"Error during benchmark: {e}")
|
| 206 |
+
|
| 207 |
+
return result
|
| 208 |
+
|
| 209 |
+
def run_full_benchmark(validate_audio: bool = False, runs_per_prompt: int = 3) -> Dict[str, Any]:
|
| 210 |
+
"""Run full benchmark suite."""
|
| 211 |
+
all_results = []
|
| 212 |
+
summary = {
|
| 213 |
+
"timestamp": time.time(),
|
| 214 |
+
"server_url": TTS_SERVER_HOST,
|
| 215 |
+
"whisper_endpoint": WHISPER_ENDPOINT if validate_audio else "disabled",
|
| 216 |
+
"results_by_type": {},
|
| 217 |
+
}
|
| 218 |
+
|
| 219 |
+
for prompt_type, prompt_text in TEST_PROMPTS.items():
|
| 220 |
+
print(f"\n=== Benchmarking {prompt_type} prompt ({len(prompt_text)} chars) ===")
|
| 221 |
+
prompt_results = []
|
| 222 |
+
|
| 223 |
+
for i in range(runs_per_prompt):
|
| 224 |
+
print(f" Run {i+1}/{runs_per_prompt}...", end=" ", flush=True)
|
| 225 |
+
result = run_benchmark(prompt_type, prompt_text, validate_audio)
|
| 226 |
+
prompt_results.append(result)
|
| 227 |
+
all_results.append(result)
|
| 228 |
+
|
| 229 |
+
if result.error:
|
| 230 |
+
print(f"ERROR: {result.error}")
|
| 231 |
+
else:
|
| 232 |
+
print(f"TTFB={result.ttfb_sec:.3f}s RTF={result.rtf:.2f}x TotalTime={result.total_latency_sec:.3f}s")
|
| 233 |
+
|
| 234 |
+
time.sleep(0.5) # Brief pause between runs
|
| 235 |
+
|
| 236 |
+
# Summarize this prompt type
|
| 237 |
+
successful_results = [r for r in prompt_results if not r.error]
|
| 238 |
+
if successful_results:
|
| 239 |
+
ttfbs = [r.ttfb_sec for r in successful_results if r.ttfb_sec is not None]
|
| 240 |
+
rtfs = [r.rtf for r in successful_results if r.rtf is not None]
|
| 241 |
+
latencies = [r.total_latency_sec for r in successful_results]
|
| 242 |
+
similarities = [r.similarity_score for r in successful_results if r.similarity_score is not None]
|
| 243 |
+
|
| 244 |
+
prompt_summary = {
|
| 245 |
+
"runs": len(prompt_results),
|
| 246 |
+
"successful": len(successful_results),
|
| 247 |
+
"ttfb": {
|
| 248 |
+
"mean_sec": statistics.mean(ttfbs) if ttfbs else None,
|
| 249 |
+
"p50_sec": sorted(ttfbs)[len(ttfbs)//2] if ttfbs else None,
|
| 250 |
+
"p95_sec": sorted(ttfbs)[int(len(ttfbs)*0.95)] if ttfbs else None,
|
| 251 |
+
},
|
| 252 |
+
"rtf": {
|
| 253 |
+
"mean": statistics.mean(rtfs) if rtfs else None,
|
| 254 |
+
"p50": sorted(rtfs)[len(rtfs)//2] if rtfs else None,
|
| 255 |
+
"p95": sorted(rtfs)[int(len(rtfs)*0.95)] if rtfs else None,
|
| 256 |
+
},
|
| 257 |
+
"latency": {
|
| 258 |
+
"mean_sec": statistics.mean(latencies) if latencies else None,
|
| 259 |
+
"min_sec": min(latencies) if latencies else None,
|
| 260 |
+
"max_sec": max(latencies) if latencies else None,
|
| 261 |
+
},
|
| 262 |
+
"similarity_score": {
|
| 263 |
+
"mean": statistics.mean(similarities) if similarities else None,
|
| 264 |
+
} if similarities else None,
|
| 265 |
+
}
|
| 266 |
+
summary["results_by_type"][prompt_type] = prompt_summary
|
| 267 |
+
|
| 268 |
+
# Print stage breakdown for first successful run
|
| 269 |
+
if successful_results[0].stage_timings:
|
| 270 |
+
print(f" Stage breakdown (first run):")
|
| 271 |
+
for stage, time_sec in sorted(successful_results[0].stage_timings.items()):
|
| 272 |
+
print(f" {stage}: {time_sec*1000:.1f}ms")
|
| 273 |
+
|
| 274 |
+
summary["all_results"] = [r.to_dict() for r in all_results]
|
| 275 |
+
return summary
|
| 276 |
+
|
| 277 |
+
def main():
|
| 278 |
+
parser = argparse.ArgumentParser(description="Benchmark MOSS-TTS realtime optimizations")
|
| 279 |
+
parser.add_argument("--validate-audio", action="store_true", help="Validate generated audio via Whisper API")
|
| 280 |
+
parser.add_argument("--runs", type=int, default=3, help="Runs per prompt type")
|
| 281 |
+
parser.add_argument("--output", type=str, default=None, help="Save results to JSON file")
|
| 282 |
+
args = parser.parse_args()
|
| 283 |
+
|
| 284 |
+
print("Starting MOSS-TTS Realtime Benchmark Suite")
|
| 285 |
+
print(f"TTS Server: {TTS_SERVER_HOST}")
|
| 286 |
+
print(f"Whisper Endpoint: {WHISPER_ENDPOINT if args.validate_audio else 'disabled'}")
|
| 287 |
+
|
| 288 |
+
results = run_full_benchmark(validate_audio=args.validate_audio, runs_per_prompt=args.runs)
|
| 289 |
+
|
| 290 |
+
# Print summary
|
| 291 |
+
print("\n" + "="*60)
|
| 292 |
+
print("BENCHMARK SUMMARY")
|
| 293 |
+
print("="*60)
|
| 294 |
+
for prompt_type, summary in results["results_by_type"].items():
|
| 295 |
+
print(f"\n{prompt_type.upper()}:")
|
| 296 |
+
print(f" TTFB (mean): {summary['ttfb']['mean_sec']*1000:.1f}ms")
|
| 297 |
+
print(f" RTF (mean): {summary['rtf']['mean']:.2f}x")
|
| 298 |
+
print(f" Latency: {summary['latency']['mean_sec']:.3f}s")
|
| 299 |
+
if summary.get("similarity_score"):
|
| 300 |
+
print(f" Whisper sim: {summary['similarity_score']['mean']:.2%}")
|
| 301 |
+
|
| 302 |
+
if args.output:
|
| 303 |
+
with open(args.output, "w") as f:
|
| 304 |
+
json.dump(results, f, indent=2)
|
| 305 |
+
print(f"\nResults saved to {args.output}")
|
| 306 |
+
|
| 307 |
+
return 0
|
| 308 |
+
|
| 309 |
+
if __name__ == "__main__":
|
| 310 |
+
sys.exit(main())
|
clis/moss_sound_effect_app.py
ADDED
|
@@ -0,0 +1,347 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import functools
|
| 3 |
+
import importlib.util
|
| 4 |
+
import time
|
| 5 |
+
|
| 6 |
+
import gradio as gr
|
| 7 |
+
import numpy as np
|
| 8 |
+
import torch
|
| 9 |
+
from transformers import AutoModel, AutoProcessor
|
| 10 |
+
|
| 11 |
+
# Disable the broken cuDNN SDPA backend
|
| 12 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 13 |
+
# Keep these enabled as fallbacks
|
| 14 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 15 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 16 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 17 |
+
|
| 18 |
+
MODEL_PATH = "OpenMOSS-Team/MOSS-SoundEffect"
|
| 19 |
+
DEFAULT_ATTN_IMPLEMENTATION = "auto"
|
| 20 |
+
DEFAULT_MAX_NEW_TOKENS = 4096
|
| 21 |
+
TOKENS_PER_SECOND = 12.5
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
@functools.lru_cache(maxsize=1)
|
| 25 |
+
def load_backend(model_path: str, device_str: str, attn_implementation: str):
|
| 26 |
+
device = torch.device(device_str if torch.cuda.is_available() else "cpu")
|
| 27 |
+
dtype = torch.bfloat16 if device.type == "cuda" else torch.float32
|
| 28 |
+
resolved_attn_implementation = resolve_attn_implementation(
|
| 29 |
+
requested=attn_implementation,
|
| 30 |
+
device=device,
|
| 31 |
+
dtype=dtype,
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
processor = AutoProcessor.from_pretrained(
|
| 35 |
+
model_path,
|
| 36 |
+
trust_remote_code=True,
|
| 37 |
+
)
|
| 38 |
+
if hasattr(processor, "audio_tokenizer"):
|
| 39 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 40 |
+
|
| 41 |
+
model_kwargs = {
|
| 42 |
+
"trust_remote_code": True,
|
| 43 |
+
"torch_dtype": dtype,
|
| 44 |
+
}
|
| 45 |
+
if resolved_attn_implementation:
|
| 46 |
+
model_kwargs["attn_implementation"] = resolved_attn_implementation
|
| 47 |
+
|
| 48 |
+
model = AutoModel.from_pretrained(model_path, **model_kwargs).to(device)
|
| 49 |
+
model.eval()
|
| 50 |
+
|
| 51 |
+
sample_rate = int(getattr(processor.model_config, "sampling_rate", 24000))
|
| 52 |
+
return model, processor, device, sample_rate
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def resolve_attn_implementation(requested: str, device: torch.device, dtype: torch.dtype) -> str | None:
|
| 56 |
+
requested_norm = (requested or "").strip().lower()
|
| 57 |
+
|
| 58 |
+
if requested_norm in {"none"}:
|
| 59 |
+
return None
|
| 60 |
+
|
| 61 |
+
if requested_norm not in {"", "auto"}:
|
| 62 |
+
return requested
|
| 63 |
+
|
| 64 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 65 |
+
if (
|
| 66 |
+
device.type == "cuda"
|
| 67 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 68 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 69 |
+
):
|
| 70 |
+
major, _ = torch.cuda.get_device_capability(device)
|
| 71 |
+
if major >= 8:
|
| 72 |
+
return "flash_attention_2"
|
| 73 |
+
|
| 74 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 75 |
+
if device.type == "cuda":
|
| 76 |
+
return "sdpa"
|
| 77 |
+
|
| 78 |
+
# CPU fallback.
|
| 79 |
+
return "eager"
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def build_conversation(ambient_sound: str, duration_seconds: float, processor):
|
| 83 |
+
ambient_sound = (ambient_sound or "").strip()
|
| 84 |
+
if not ambient_sound:
|
| 85 |
+
raise ValueError("Please enter an ambient sound description.")
|
| 86 |
+
|
| 87 |
+
expected_tokens = max(1, int(float(duration_seconds) * TOKENS_PER_SECOND))
|
| 88 |
+
user_kwargs = {
|
| 89 |
+
"ambient_sound": ambient_sound,
|
| 90 |
+
"tokens": expected_tokens,
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
return [[processor.build_user_message(**user_kwargs)]], expected_tokens
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def run_inference(
|
| 97 |
+
ambient_sound: str,
|
| 98 |
+
duration_seconds: float,
|
| 99 |
+
temperature: float,
|
| 100 |
+
top_p: float,
|
| 101 |
+
top_k: int,
|
| 102 |
+
repetition_penalty: float,
|
| 103 |
+
max_new_tokens: int,
|
| 104 |
+
model_path: str,
|
| 105 |
+
device: str,
|
| 106 |
+
attn_implementation: str,
|
| 107 |
+
):
|
| 108 |
+
started_at = time.monotonic()
|
| 109 |
+
model, processor, torch_device, sample_rate = load_backend(
|
| 110 |
+
model_path=model_path,
|
| 111 |
+
device_str=device,
|
| 112 |
+
attn_implementation=attn_implementation,
|
| 113 |
+
)
|
| 114 |
+
|
| 115 |
+
conversations, expected_tokens = build_conversation(
|
| 116 |
+
ambient_sound=ambient_sound,
|
| 117 |
+
duration_seconds=duration_seconds,
|
| 118 |
+
processor=processor,
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
batch = processor(conversations, mode="generation")
|
| 122 |
+
input_ids = batch["input_ids"].to(torch_device)
|
| 123 |
+
attention_mask = batch["attention_mask"].to(torch_device)
|
| 124 |
+
|
| 125 |
+
with torch.no_grad():
|
| 126 |
+
outputs = model.generate(
|
| 127 |
+
input_ids=input_ids,
|
| 128 |
+
attention_mask=attention_mask,
|
| 129 |
+
max_new_tokens=int(max_new_tokens),
|
| 130 |
+
audio_temperature=float(temperature),
|
| 131 |
+
audio_top_p=float(top_p),
|
| 132 |
+
audio_top_k=int(top_k),
|
| 133 |
+
audio_repetition_penalty=float(repetition_penalty),
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
messages = processor.decode(outputs)
|
| 137 |
+
if not messages or messages[0] is None:
|
| 138 |
+
raise RuntimeError("The model did not return a decodable audio result.")
|
| 139 |
+
|
| 140 |
+
audio = messages[0].audio_codes_list[0]
|
| 141 |
+
if isinstance(audio, torch.Tensor):
|
| 142 |
+
audio_np = audio.detach().float().cpu().numpy()
|
| 143 |
+
else:
|
| 144 |
+
audio_np = np.asarray(audio, dtype=np.float32)
|
| 145 |
+
|
| 146 |
+
if audio_np.ndim > 1:
|
| 147 |
+
audio_np = audio_np.reshape(-1)
|
| 148 |
+
audio_np = audio_np.astype(np.float32, copy=False)
|
| 149 |
+
|
| 150 |
+
elapsed = time.monotonic() - started_at
|
| 151 |
+
status = (
|
| 152 |
+
f"Done | elapsed: {elapsed:.2f}s | "
|
| 153 |
+
f"duration_seconds={float(duration_seconds):.0f}, expected_tokens={int(expected_tokens)}, "
|
| 154 |
+
f"max_new_tokens={int(max_new_tokens)}, "
|
| 155 |
+
f"audio_temperature={float(temperature):.2f}, audio_top_p={float(top_p):.2f}, "
|
| 156 |
+
f"audio_top_k={int(top_k)}, audio_repetition_penalty={float(repetition_penalty):.2f}"
|
| 157 |
+
)
|
| 158 |
+
return (sample_rate, audio_np), status
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def build_demo(args: argparse.Namespace):
|
| 162 |
+
custom_css = """
|
| 163 |
+
:root {
|
| 164 |
+
--bg: #f6f7f8;
|
| 165 |
+
--panel: #ffffff;
|
| 166 |
+
--ink: #111418;
|
| 167 |
+
--muted: #4d5562;
|
| 168 |
+
--line: #e5e7eb;
|
| 169 |
+
--accent: #0f766e;
|
| 170 |
+
}
|
| 171 |
+
.gradio-container {
|
| 172 |
+
background: linear-gradient(180deg, #f7f8fa 0%, #f3f5f7 100%);
|
| 173 |
+
color: var(--ink);
|
| 174 |
+
}
|
| 175 |
+
.app-card {
|
| 176 |
+
border: 1px solid var(--line);
|
| 177 |
+
border-radius: 16px;
|
| 178 |
+
background: var(--panel);
|
| 179 |
+
padding: 14px;
|
| 180 |
+
}
|
| 181 |
+
.app-title {
|
| 182 |
+
font-size: 22px;
|
| 183 |
+
font-weight: 700;
|
| 184 |
+
margin-bottom: 6px;
|
| 185 |
+
letter-spacing: 0.2px;
|
| 186 |
+
}
|
| 187 |
+
.app-subtitle {
|
| 188 |
+
color: var(--muted);
|
| 189 |
+
font-size: 14px;
|
| 190 |
+
margin-bottom: 8px;
|
| 191 |
+
}
|
| 192 |
+
#output_audio {
|
| 193 |
+
padding-bottom: 12px;
|
| 194 |
+
margin-bottom: 8px;
|
| 195 |
+
overflow: hidden !important;
|
| 196 |
+
}
|
| 197 |
+
#output_audio > .wrap {
|
| 198 |
+
overflow: hidden !important;
|
| 199 |
+
}
|
| 200 |
+
#output_audio audio {
|
| 201 |
+
margin-bottom: 6px;
|
| 202 |
+
}
|
| 203 |
+
#run-btn {
|
| 204 |
+
background: var(--accent);
|
| 205 |
+
border: none;
|
| 206 |
+
}
|
| 207 |
+
"""
|
| 208 |
+
|
| 209 |
+
with gr.Blocks(title="MOSS-SoundEffect Demo", css=custom_css) as demo:
|
| 210 |
+
gr.Markdown(
|
| 211 |
+
"""
|
| 212 |
+
<div class="app-card">
|
| 213 |
+
<div class="app-title">MOSS-SoundEffect</div>
|
| 214 |
+
<div class="app-subtitle">Generate ambient sounds and sound effects from text descriptions.</div>
|
| 215 |
+
</div>
|
| 216 |
+
"""
|
| 217 |
+
)
|
| 218 |
+
|
| 219 |
+
with gr.Row(equal_height=False):
|
| 220 |
+
with gr.Column(scale=3):
|
| 221 |
+
ambient_sound = gr.Textbox(
|
| 222 |
+
label="Ambient Sound Description",
|
| 223 |
+
lines=8,
|
| 224 |
+
placeholder="Example: Thunder rolls in the distance while heavy rain falls on a metal roof.",
|
| 225 |
+
)
|
| 226 |
+
duration_seconds = gr.Slider(
|
| 227 |
+
minimum=1,
|
| 228 |
+
maximum=60,
|
| 229 |
+
step=1,
|
| 230 |
+
value=10,
|
| 231 |
+
label="Duration (seconds)",
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
with gr.Accordion("Sampling Parameters (Audio)", open=True):
|
| 235 |
+
temperature = gr.Slider(
|
| 236 |
+
minimum=0.1,
|
| 237 |
+
maximum=3.0,
|
| 238 |
+
step=0.05,
|
| 239 |
+
value=1.5,
|
| 240 |
+
label="temperature",
|
| 241 |
+
)
|
| 242 |
+
top_p = gr.Slider(
|
| 243 |
+
minimum=0.1,
|
| 244 |
+
maximum=1.0,
|
| 245 |
+
step=0.01,
|
| 246 |
+
value=0.6,
|
| 247 |
+
label="top_p",
|
| 248 |
+
)
|
| 249 |
+
top_k = gr.Slider(
|
| 250 |
+
minimum=1,
|
| 251 |
+
maximum=200,
|
| 252 |
+
step=1,
|
| 253 |
+
value=50,
|
| 254 |
+
label="top_k",
|
| 255 |
+
)
|
| 256 |
+
repetition_penalty = gr.Slider(
|
| 257 |
+
minimum=0.8,
|
| 258 |
+
maximum=2.0,
|
| 259 |
+
step=0.05,
|
| 260 |
+
value=1.2,
|
| 261 |
+
label="repetition_penalty",
|
| 262 |
+
)
|
| 263 |
+
max_new_tokens = gr.Slider(
|
| 264 |
+
minimum=256,
|
| 265 |
+
maximum=8192,
|
| 266 |
+
step=128,
|
| 267 |
+
value=DEFAULT_MAX_NEW_TOKENS,
|
| 268 |
+
label="max_new_tokens",
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
run_btn = gr.Button("Generate Sound Effect", variant="primary", elem_id="run-btn")
|
| 272 |
+
|
| 273 |
+
with gr.Column(scale=2):
|
| 274 |
+
output_audio = gr.Audio(label="Output Audio", type="numpy", elem_id="output_audio")
|
| 275 |
+
status = gr.Textbox(label="Status", lines=4, interactive=False)
|
| 276 |
+
|
| 277 |
+
run_btn.click(
|
| 278 |
+
fn=lambda ambient_sound, duration_seconds, temperature, top_p, top_k, repetition_penalty, max_new_tokens: run_inference(
|
| 279 |
+
ambient_sound=ambient_sound,
|
| 280 |
+
duration_seconds=duration_seconds,
|
| 281 |
+
temperature=temperature,
|
| 282 |
+
top_p=top_p,
|
| 283 |
+
top_k=top_k,
|
| 284 |
+
repetition_penalty=repetition_penalty,
|
| 285 |
+
max_new_tokens=max_new_tokens,
|
| 286 |
+
model_path=args.model_path,
|
| 287 |
+
device=args.device,
|
| 288 |
+
attn_implementation=args.attn_implementation,
|
| 289 |
+
),
|
| 290 |
+
inputs=[
|
| 291 |
+
ambient_sound,
|
| 292 |
+
duration_seconds,
|
| 293 |
+
temperature,
|
| 294 |
+
top_p,
|
| 295 |
+
top_k,
|
| 296 |
+
repetition_penalty,
|
| 297 |
+
max_new_tokens,
|
| 298 |
+
],
|
| 299 |
+
outputs=[output_audio, status],
|
| 300 |
+
)
|
| 301 |
+
return demo
|
| 302 |
+
|
| 303 |
+
|
| 304 |
+
def main():
|
| 305 |
+
parser = argparse.ArgumentParser(description="MOSS-SoundEffect Gradio Demo")
|
| 306 |
+
parser.add_argument("--model_path", type=str, default=MODEL_PATH)
|
| 307 |
+
parser.add_argument("--device", type=str, default="cuda:0")
|
| 308 |
+
parser.add_argument("--attn_implementation", type=str, default=DEFAULT_ATTN_IMPLEMENTATION)
|
| 309 |
+
parser.add_argument("--host", type=str, default="0.0.0.0")
|
| 310 |
+
parser.add_argument("--port", type=int, default=7861)
|
| 311 |
+
parser.add_argument("--share", action="store_true")
|
| 312 |
+
args = parser.parse_args()
|
| 313 |
+
|
| 314 |
+
runtime_device = torch.device(args.device if torch.cuda.is_available() else "cpu")
|
| 315 |
+
runtime_dtype = torch.bfloat16 if runtime_device.type == "cuda" else torch.float32
|
| 316 |
+
args.attn_implementation = resolve_attn_implementation(
|
| 317 |
+
requested=args.attn_implementation,
|
| 318 |
+
device=runtime_device,
|
| 319 |
+
dtype=runtime_dtype,
|
| 320 |
+
) or "none"
|
| 321 |
+
print(f"[INFO] Using attn_implementation={args.attn_implementation}", flush=True)
|
| 322 |
+
|
| 323 |
+
preload_started_at = time.monotonic()
|
| 324 |
+
print(
|
| 325 |
+
f"[Startup] Preloading backend: model={args.model_path}, device={args.device}, attn={args.attn_implementation}",
|
| 326 |
+
flush=True,
|
| 327 |
+
)
|
| 328 |
+
load_backend(
|
| 329 |
+
model_path=args.model_path,
|
| 330 |
+
device_str=args.device,
|
| 331 |
+
attn_implementation=args.attn_implementation,
|
| 332 |
+
)
|
| 333 |
+
print(
|
| 334 |
+
f"[Startup] Backend preload finished in {time.monotonic() - preload_started_at:.2f}s",
|
| 335 |
+
flush=True,
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
demo = build_demo(args)
|
| 339 |
+
demo.queue(max_size=16, default_concurrency_limit=1).launch(
|
| 340 |
+
server_name=args.host,
|
| 341 |
+
server_port=args.port,
|
| 342 |
+
share=args.share,
|
| 343 |
+
)
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
if __name__ == "__main__":
|
| 347 |
+
main()
|
clis/moss_tts_app.py
ADDED
|
@@ -0,0 +1,621 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import functools
|
| 3 |
+
import importlib.util
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
import re
|
| 6 |
+
import time
|
| 7 |
+
import orjson
|
| 8 |
+
|
| 9 |
+
import gradio as gr
|
| 10 |
+
import numpy as np
|
| 11 |
+
import torch
|
| 12 |
+
from transformers import AutoModel, AutoProcessor
|
| 13 |
+
|
| 14 |
+
# Disable the broken cuDNN SDPA backend
|
| 15 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 16 |
+
# Keep these enabled as fallbacks
|
| 17 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 18 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 19 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 20 |
+
|
| 21 |
+
MODEL_PATH = "OpenMOSS-Team/MOSS-TTS"
|
| 22 |
+
DEFAULT_ATTN_IMPLEMENTATION = "auto"
|
| 23 |
+
DEFAULT_MAX_NEW_TOKENS = 4096
|
| 24 |
+
CONTINUATION_NOTICE = (
|
| 25 |
+
"Continuation mode is active. Make sure the reference audio transcript is prepended to the input text."
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
MODE_CLONE = "Clone"
|
| 29 |
+
MODE_CONTINUE = "Continuation"
|
| 30 |
+
MODE_CONTINUE_CLONE = "Continuation + Clone"
|
| 31 |
+
ZH_TOKENS_PER_CHAR = 3.098411951313033
|
| 32 |
+
EN_TOKENS_PER_CHAR = 0.8673376262755219
|
| 33 |
+
REFERENCE_AUDIO_DIR = Path(__file__).resolve().parent.parent / "assets" / "audio"
|
| 34 |
+
EXAMPLE_TEXTS_JSONL_PATH = Path(__file__).resolve().parent.parent / "assets" / "text" / "moss_tts_example_texts.jsonl"
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def _parse_example_id(example_id: str) -> tuple[str, int] | None:
|
| 38 |
+
matched = re.fullmatch(r"(zh|en)/(\d+)", (example_id or "").strip())
|
| 39 |
+
if matched is None:
|
| 40 |
+
return None
|
| 41 |
+
return matched.group(1), int(matched.group(2))
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def _resolve_reference_audio_path(language: str, index: int) -> Path | None:
|
| 45 |
+
stem_candidates = [f"reference_{language}_{index}"]
|
| 46 |
+
for stem in stem_candidates:
|
| 47 |
+
for ext in (".wav", ".mp3"):
|
| 48 |
+
audio_path = REFERENCE_AUDIO_DIR / f"{stem}{ext}"
|
| 49 |
+
if audio_path.exists():
|
| 50 |
+
return audio_path
|
| 51 |
+
return None
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def build_example_rows() -> list[tuple[str, str, str]]:
|
| 55 |
+
rows: list[tuple[str, str, str]] = []
|
| 56 |
+
|
| 57 |
+
with open(EXAMPLE_TEXTS_JSONL_PATH, "rb") as f:
|
| 58 |
+
for line in f:
|
| 59 |
+
if not line.strip():
|
| 60 |
+
continue
|
| 61 |
+
sample = orjson.loads(line)
|
| 62 |
+
parsed = _parse_example_id(sample.get("id", ""))
|
| 63 |
+
if parsed is None:
|
| 64 |
+
continue
|
| 65 |
+
|
| 66 |
+
language, index = parsed
|
| 67 |
+
text = str(sample.get("text", "")).strip()
|
| 68 |
+
audio_path = _resolve_reference_audio_path(language, index)
|
| 69 |
+
if audio_path is None:
|
| 70 |
+
continue
|
| 71 |
+
|
| 72 |
+
rows.append((sample['role'], str(audio_path), text))
|
| 73 |
+
|
| 74 |
+
return rows
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
EXAMPLE_ROWS = build_example_rows()
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
@functools.lru_cache(maxsize=1)
|
| 81 |
+
def load_backend(model_path: str, device_str: str, attn_implementation: str):
|
| 82 |
+
device = torch.device(device_str if torch.cuda.is_available() else "cpu")
|
| 83 |
+
dtype = torch.bfloat16 if device.type == "cuda" else torch.float32
|
| 84 |
+
resolved_attn_implementation = resolve_attn_implementation(
|
| 85 |
+
requested=attn_implementation,
|
| 86 |
+
device=device,
|
| 87 |
+
dtype=dtype,
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
processor = AutoProcessor.from_pretrained(
|
| 91 |
+
model_path,
|
| 92 |
+
trust_remote_code=True,
|
| 93 |
+
)
|
| 94 |
+
if hasattr(processor, "audio_tokenizer"):
|
| 95 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 96 |
+
|
| 97 |
+
model_kwargs = {
|
| 98 |
+
"trust_remote_code": True,
|
| 99 |
+
"torch_dtype": dtype,
|
| 100 |
+
}
|
| 101 |
+
if resolved_attn_implementation:
|
| 102 |
+
model_kwargs["attn_implementation"] = resolved_attn_implementation
|
| 103 |
+
|
| 104 |
+
model = AutoModel.from_pretrained(model_path, **model_kwargs).to(device)
|
| 105 |
+
model.eval()
|
| 106 |
+
|
| 107 |
+
sample_rate = int(getattr(processor.model_config, "sampling_rate", 24000))
|
| 108 |
+
return model, processor, device, sample_rate
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def resolve_attn_implementation(requested: str, device: torch.device, dtype: torch.dtype) -> str | None:
|
| 112 |
+
requested_norm = (requested or "").strip().lower()
|
| 113 |
+
|
| 114 |
+
if requested_norm in {"none"}:
|
| 115 |
+
return None
|
| 116 |
+
|
| 117 |
+
if requested_norm not in {"", "auto"}:
|
| 118 |
+
return requested
|
| 119 |
+
|
| 120 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 121 |
+
if (
|
| 122 |
+
device.type == "cuda"
|
| 123 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 124 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 125 |
+
):
|
| 126 |
+
major, _ = torch.cuda.get_device_capability(device)
|
| 127 |
+
if major >= 8:
|
| 128 |
+
return "flash_attention_2"
|
| 129 |
+
|
| 130 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 131 |
+
if device.type == "cuda":
|
| 132 |
+
return "sdpa"
|
| 133 |
+
|
| 134 |
+
# CPU fallback.
|
| 135 |
+
return "eager"
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def detect_text_language(text: str) -> str:
|
| 139 |
+
zh_chars = len(re.findall(r"[\u4e00-\u9fff]", text))
|
| 140 |
+
en_chars = len(re.findall(r"[A-Za-z]", text))
|
| 141 |
+
if zh_chars == 0 and en_chars == 0:
|
| 142 |
+
return "en"
|
| 143 |
+
return "zh" if zh_chars >= en_chars else "en"
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def supports_duration_control(mode_with_reference: str) -> bool:
|
| 147 |
+
return mode_with_reference not in {MODE_CONTINUE, MODE_CONTINUE_CLONE}
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def estimate_duration_tokens(text: str) -> tuple[str, int, int, int]:
|
| 151 |
+
normalized = text or ""
|
| 152 |
+
effective_len = max(len(normalized), 1)
|
| 153 |
+
language = detect_text_language(normalized)
|
| 154 |
+
factor = ZH_TOKENS_PER_CHAR if language == "zh" else EN_TOKENS_PER_CHAR
|
| 155 |
+
default_tokens = max(1, int(effective_len * factor))
|
| 156 |
+
min_tokens = max(1, int(default_tokens * 0.5))
|
| 157 |
+
max_tokens = max(min_tokens, int(default_tokens * 1.5))
|
| 158 |
+
return language, default_tokens, min_tokens, max_tokens
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def update_duration_controls(
|
| 162 |
+
enabled: bool,
|
| 163 |
+
text: str,
|
| 164 |
+
current_tokens: float | int | None,
|
| 165 |
+
mode_with_reference: str,
|
| 166 |
+
):
|
| 167 |
+
if not supports_duration_control(mode_with_reference):
|
| 168 |
+
return (
|
| 169 |
+
gr.update(visible=False),
|
| 170 |
+
"Duration control is disabled for Continuation modes.",
|
| 171 |
+
gr.update(value=False, interactive=False),
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
checkbox_update = gr.update(interactive=True)
|
| 175 |
+
if not enabled:
|
| 176 |
+
return gr.update(visible=False), "Duration control is disabled.", checkbox_update
|
| 177 |
+
|
| 178 |
+
language, default_tokens, min_tokens, max_tokens = estimate_duration_tokens(text)
|
| 179 |
+
# Slider is initialized with value=1 as a placeholder; treat it as "unset"
|
| 180 |
+
# so first-time estimation uses the computed default instead of clamping to min.
|
| 181 |
+
if current_tokens is None or int(current_tokens) == 1:
|
| 182 |
+
slider_value = default_tokens
|
| 183 |
+
else:
|
| 184 |
+
slider_value = int(current_tokens)
|
| 185 |
+
slider_value = max(min_tokens, min(max_tokens, slider_value))
|
| 186 |
+
|
| 187 |
+
language_label = "Chinese" if language == "zh" else "English"
|
| 188 |
+
hint = (
|
| 189 |
+
f"Duration control enabled | detected language: {language_label} | "
|
| 190 |
+
f"default={default_tokens}, range=[{min_tokens}, {max_tokens}]"
|
| 191 |
+
)
|
| 192 |
+
return (
|
| 193 |
+
gr.update(
|
| 194 |
+
visible=True,
|
| 195 |
+
minimum=min_tokens,
|
| 196 |
+
maximum=max_tokens,
|
| 197 |
+
value=slider_value,
|
| 198 |
+
step=1,
|
| 199 |
+
),
|
| 200 |
+
hint,
|
| 201 |
+
checkbox_update,
|
| 202 |
+
)
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
def build_conversation(
|
| 206 |
+
text: str,
|
| 207 |
+
reference_audio: str | None,
|
| 208 |
+
mode_with_reference: str,
|
| 209 |
+
expected_tokens: int | None,
|
| 210 |
+
processor,
|
| 211 |
+
):
|
| 212 |
+
text = (text or "").strip()
|
| 213 |
+
if not text:
|
| 214 |
+
raise ValueError("Please enter text to synthesize.")
|
| 215 |
+
|
| 216 |
+
user_kwargs = {"text": text}
|
| 217 |
+
if expected_tokens is not None:
|
| 218 |
+
user_kwargs["tokens"] = int(expected_tokens)
|
| 219 |
+
|
| 220 |
+
if not reference_audio:
|
| 221 |
+
conversations = [[processor.build_user_message(**user_kwargs)]]
|
| 222 |
+
return conversations, "generation", "Direct Generation"
|
| 223 |
+
|
| 224 |
+
if mode_with_reference == MODE_CLONE:
|
| 225 |
+
clone_kwargs = dict(user_kwargs)
|
| 226 |
+
clone_kwargs["reference"] = [reference_audio]
|
| 227 |
+
conversations = [[processor.build_user_message(**clone_kwargs)]]
|
| 228 |
+
return conversations, "generation", MODE_CLONE
|
| 229 |
+
|
| 230 |
+
if mode_with_reference == MODE_CONTINUE:
|
| 231 |
+
conversations = [
|
| 232 |
+
[
|
| 233 |
+
processor.build_user_message(**user_kwargs),
|
| 234 |
+
processor.build_assistant_message(audio_codes_list=[reference_audio]),
|
| 235 |
+
]
|
| 236 |
+
]
|
| 237 |
+
return conversations, "continuation", MODE_CONTINUE
|
| 238 |
+
|
| 239 |
+
continue_clone_kwargs = dict(user_kwargs)
|
| 240 |
+
continue_clone_kwargs["reference"] = [reference_audio]
|
| 241 |
+
conversations = [
|
| 242 |
+
[
|
| 243 |
+
processor.build_user_message(**continue_clone_kwargs),
|
| 244 |
+
processor.build_assistant_message(audio_codes_list=[reference_audio]),
|
| 245 |
+
]
|
| 246 |
+
]
|
| 247 |
+
return conversations, "continuation", MODE_CONTINUE_CLONE
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
def render_mode_hint(reference_audio: str | None, mode_with_reference: str):
|
| 251 |
+
if not reference_audio:
|
| 252 |
+
return "Current mode: **Direct Generation** (no reference audio uploaded)"
|
| 253 |
+
if mode_with_reference == MODE_CLONE:
|
| 254 |
+
return "Current mode: **Clone** (speaker timbre will be cloned from the reference audio)"
|
| 255 |
+
return f"Current mode: **{mode_with_reference}** \n> {CONTINUATION_NOTICE}"
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
def apply_example_selection(
|
| 259 |
+
mode_with_reference: str,
|
| 260 |
+
duration_control_enabled: bool,
|
| 261 |
+
duration_tokens: int,
|
| 262 |
+
evt: gr.SelectData,
|
| 263 |
+
):
|
| 264 |
+
if evt is None or evt.index is None:
|
| 265 |
+
return gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), gr.update()
|
| 266 |
+
|
| 267 |
+
if isinstance(evt.index, (tuple, list)):
|
| 268 |
+
row_idx = int(evt.index[0])
|
| 269 |
+
else:
|
| 270 |
+
row_idx = int(evt.index)
|
| 271 |
+
|
| 272 |
+
if row_idx < 0 or row_idx >= len(EXAMPLE_ROWS):
|
| 273 |
+
return gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), gr.update()
|
| 274 |
+
|
| 275 |
+
_, audio_path, example_text = EXAMPLE_ROWS[row_idx]
|
| 276 |
+
duration_slider_update, duration_hint, duration_checkbox_update = update_duration_controls(
|
| 277 |
+
duration_control_enabled,
|
| 278 |
+
example_text,
|
| 279 |
+
duration_tokens,
|
| 280 |
+
mode_with_reference,
|
| 281 |
+
)
|
| 282 |
+
return (
|
| 283 |
+
audio_path,
|
| 284 |
+
example_text,
|
| 285 |
+
render_mode_hint(audio_path, mode_with_reference),
|
| 286 |
+
duration_slider_update,
|
| 287 |
+
duration_hint,
|
| 288 |
+
duration_checkbox_update,
|
| 289 |
+
)
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
def run_inference(
|
| 293 |
+
text: str,
|
| 294 |
+
reference_audio: str | None,
|
| 295 |
+
mode_with_reference: str,
|
| 296 |
+
duration_control_enabled: bool,
|
| 297 |
+
duration_tokens: int,
|
| 298 |
+
temperature: float,
|
| 299 |
+
top_p: float,
|
| 300 |
+
top_k: int,
|
| 301 |
+
repetition_penalty: float,
|
| 302 |
+
model_path: str,
|
| 303 |
+
device: str,
|
| 304 |
+
attn_implementation: str,
|
| 305 |
+
max_new_tokens: int,
|
| 306 |
+
):
|
| 307 |
+
started_at = time.monotonic()
|
| 308 |
+
model, processor, torch_device, sample_rate = load_backend(
|
| 309 |
+
model_path=model_path,
|
| 310 |
+
device_str=device,
|
| 311 |
+
attn_implementation=attn_implementation,
|
| 312 |
+
)
|
| 313 |
+
duration_enabled = bool(duration_control_enabled and supports_duration_control(mode_with_reference))
|
| 314 |
+
expected_tokens = int(duration_tokens) if duration_enabled else None
|
| 315 |
+
conversations, mode, mode_name = build_conversation(
|
| 316 |
+
text=text,
|
| 317 |
+
reference_audio=reference_audio,
|
| 318 |
+
mode_with_reference=mode_with_reference,
|
| 319 |
+
expected_tokens=expected_tokens,
|
| 320 |
+
processor=processor,
|
| 321 |
+
)
|
| 322 |
+
|
| 323 |
+
batch = processor(conversations, mode=mode)
|
| 324 |
+
input_ids = batch["input_ids"].to(torch_device)
|
| 325 |
+
attention_mask = batch["attention_mask"].to(torch_device)
|
| 326 |
+
|
| 327 |
+
with torch.no_grad():
|
| 328 |
+
outputs = model.generate(
|
| 329 |
+
input_ids=input_ids,
|
| 330 |
+
attention_mask=attention_mask,
|
| 331 |
+
max_new_tokens=int(max_new_tokens),
|
| 332 |
+
audio_temperature=float(temperature),
|
| 333 |
+
audio_top_p=float(top_p),
|
| 334 |
+
audio_top_k=int(top_k),
|
| 335 |
+
audio_repetition_penalty=float(repetition_penalty),
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
messages = processor.decode(outputs)
|
| 339 |
+
if not messages or messages[0] is None:
|
| 340 |
+
raise RuntimeError("The model did not return a decodable audio result.")
|
| 341 |
+
|
| 342 |
+
audio = messages[0].audio_codes_list[0]
|
| 343 |
+
if isinstance(audio, torch.Tensor):
|
| 344 |
+
audio_np = audio.detach().float().cpu().numpy()
|
| 345 |
+
else:
|
| 346 |
+
audio_np = np.asarray(audio, dtype=np.float32)
|
| 347 |
+
|
| 348 |
+
if audio_np.ndim > 1:
|
| 349 |
+
audio_np = audio_np.reshape(-1)
|
| 350 |
+
audio_np = audio_np.astype(np.float32, copy=False)
|
| 351 |
+
|
| 352 |
+
elapsed = time.monotonic() - started_at
|
| 353 |
+
status = (
|
| 354 |
+
f"Done | mode: {mode_name} | elapsed: {elapsed:.2f}s | "
|
| 355 |
+
f"max_new_tokens={int(max_new_tokens)}, "
|
| 356 |
+
f"expected_tokens={expected_tokens if expected_tokens is not None else 'off'}, "
|
| 357 |
+
f"audio_temperature={float(temperature):.2f}, audio_top_p={float(top_p):.2f}, "
|
| 358 |
+
f"audio_top_k={int(top_k)}, audio_repetition_penalty={float(repetition_penalty):.2f}"
|
| 359 |
+
)
|
| 360 |
+
return (sample_rate, audio_np), status
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
def build_demo(args: argparse.Namespace):
|
| 364 |
+
custom_css = """
|
| 365 |
+
:root {
|
| 366 |
+
--bg: #f6f7f8;
|
| 367 |
+
--panel: #ffffff;
|
| 368 |
+
--ink: #111418;
|
| 369 |
+
--muted: #4d5562;
|
| 370 |
+
--line: #e5e7eb;
|
| 371 |
+
--accent: #0f766e;
|
| 372 |
+
}
|
| 373 |
+
.gradio-container {
|
| 374 |
+
background: linear-gradient(180deg, #f7f8fa 0%, #f3f5f7 100%);
|
| 375 |
+
color: var(--ink);
|
| 376 |
+
}
|
| 377 |
+
.app-card {
|
| 378 |
+
border: 1px solid var(--line);
|
| 379 |
+
border-radius: 16px;
|
| 380 |
+
background: var(--panel);
|
| 381 |
+
padding: 14px;
|
| 382 |
+
}
|
| 383 |
+
.app-title {
|
| 384 |
+
font-size: 22px;
|
| 385 |
+
font-weight: 700;
|
| 386 |
+
margin-bottom: 6px;
|
| 387 |
+
letter-spacing: 0.2px;
|
| 388 |
+
}
|
| 389 |
+
.app-subtitle {
|
| 390 |
+
color: var(--muted);
|
| 391 |
+
font-size: 14px;
|
| 392 |
+
margin-bottom: 8px;
|
| 393 |
+
}
|
| 394 |
+
#output_audio {
|
| 395 |
+
padding-bottom: 12px;
|
| 396 |
+
margin-bottom: 8px;
|
| 397 |
+
overflow: hidden !important;
|
| 398 |
+
}
|
| 399 |
+
#output_audio > .wrap {
|
| 400 |
+
overflow: hidden !important;
|
| 401 |
+
}
|
| 402 |
+
#output_audio audio {
|
| 403 |
+
margin-bottom: 6px;
|
| 404 |
+
}
|
| 405 |
+
#run-btn {
|
| 406 |
+
background: var(--accent);
|
| 407 |
+
border: none;
|
| 408 |
+
}
|
| 409 |
+
"""
|
| 410 |
+
|
| 411 |
+
with gr.Blocks(title="MOSS-TTS Demo", css=custom_css) as demo:
|
| 412 |
+
gr.Markdown(
|
| 413 |
+
"""
|
| 414 |
+
<div class="app-card">
|
| 415 |
+
<div class="app-title">MOSS-TTS</div>
|
| 416 |
+
<div class="app-subtitle">Minimal UI: Direct Generation, Clone, Continuation, Continuation + Clone</div>
|
| 417 |
+
</div>
|
| 418 |
+
"""
|
| 419 |
+
)
|
| 420 |
+
|
| 421 |
+
with gr.Row(equal_height=False):
|
| 422 |
+
with gr.Column(scale=3):
|
| 423 |
+
text = gr.Textbox(
|
| 424 |
+
label="Text",
|
| 425 |
+
lines=9,
|
| 426 |
+
placeholder="Enter text to synthesize. In continuation modes, prepend the reference audio transcript.",
|
| 427 |
+
)
|
| 428 |
+
reference_audio = gr.Audio(
|
| 429 |
+
label="Reference Audio (Optional)",
|
| 430 |
+
type="filepath",
|
| 431 |
+
)
|
| 432 |
+
mode_with_reference = gr.Radio(
|
| 433 |
+
choices=[MODE_CLONE, MODE_CONTINUE, MODE_CONTINUE_CLONE],
|
| 434 |
+
value=MODE_CLONE,
|
| 435 |
+
label="Mode with Reference Audio",
|
| 436 |
+
info="If no reference audio is uploaded, Direct Generation will be used automatically.",
|
| 437 |
+
)
|
| 438 |
+
mode_hint = gr.Markdown(render_mode_hint(None, MODE_CLONE))
|
| 439 |
+
duration_control_enabled = gr.Checkbox(
|
| 440 |
+
value=False,
|
| 441 |
+
label="Enable Duration Control (Expected Audio Tokens)",
|
| 442 |
+
)
|
| 443 |
+
duration_tokens = gr.Slider(
|
| 444 |
+
minimum=1,
|
| 445 |
+
maximum=1,
|
| 446 |
+
step=1,
|
| 447 |
+
value=1,
|
| 448 |
+
label="expected_tokens",
|
| 449 |
+
visible=False,
|
| 450 |
+
)
|
| 451 |
+
duration_hint = gr.Markdown("Duration control is disabled.")
|
| 452 |
+
|
| 453 |
+
with gr.Accordion("Sampling Parameters (Audio)", open=True):
|
| 454 |
+
temperature = gr.Slider(
|
| 455 |
+
minimum=0.1,
|
| 456 |
+
maximum=3.0,
|
| 457 |
+
step=0.05,
|
| 458 |
+
value=1.7,
|
| 459 |
+
label="temperature",
|
| 460 |
+
)
|
| 461 |
+
top_p = gr.Slider(
|
| 462 |
+
minimum=0.1,
|
| 463 |
+
maximum=1.0,
|
| 464 |
+
step=0.01,
|
| 465 |
+
value=0.8,
|
| 466 |
+
label="top_p",
|
| 467 |
+
)
|
| 468 |
+
top_k = gr.Slider(
|
| 469 |
+
minimum=1,
|
| 470 |
+
maximum=200,
|
| 471 |
+
step=1,
|
| 472 |
+
value=25,
|
| 473 |
+
label="top_k",
|
| 474 |
+
)
|
| 475 |
+
repetition_penalty = gr.Slider(
|
| 476 |
+
minimum=0.8,
|
| 477 |
+
maximum=2.0,
|
| 478 |
+
step=0.05,
|
| 479 |
+
value=1.0,
|
| 480 |
+
label="repetition_penalty",
|
| 481 |
+
)
|
| 482 |
+
max_new_tokens = gr.Slider(
|
| 483 |
+
minimum=256,
|
| 484 |
+
maximum=8192,
|
| 485 |
+
step=128,
|
| 486 |
+
value=DEFAULT_MAX_NEW_TOKENS,
|
| 487 |
+
label="max_new_tokens",
|
| 488 |
+
)
|
| 489 |
+
|
| 490 |
+
run_btn = gr.Button("Generate Speech", variant="primary", elem_id="run-btn")
|
| 491 |
+
|
| 492 |
+
with gr.Column(scale=2):
|
| 493 |
+
output_audio = gr.Audio(label="Output Audio", type="numpy", elem_id="output_audio")
|
| 494 |
+
status = gr.Textbox(label="Status", lines=4, interactive=False)
|
| 495 |
+
examples_table = gr.Dataframe(
|
| 496 |
+
headers=["Reference Speech", "Example Text"],
|
| 497 |
+
value=[[name, text] for name, _, text in EXAMPLE_ROWS],
|
| 498 |
+
datatype=["str", "str"],
|
| 499 |
+
row_count=(len(EXAMPLE_ROWS), "fixed"),
|
| 500 |
+
col_count=(2, "fixed"),
|
| 501 |
+
interactive=False,
|
| 502 |
+
wrap=True,
|
| 503 |
+
label="Examples (click a row to fill inputs)",
|
| 504 |
+
)
|
| 505 |
+
|
| 506 |
+
reference_audio.change(
|
| 507 |
+
fn=render_mode_hint,
|
| 508 |
+
inputs=[reference_audio, mode_with_reference],
|
| 509 |
+
outputs=[mode_hint],
|
| 510 |
+
)
|
| 511 |
+
mode_with_reference.change(
|
| 512 |
+
fn=render_mode_hint,
|
| 513 |
+
inputs=[reference_audio, mode_with_reference],
|
| 514 |
+
outputs=[mode_hint],
|
| 515 |
+
)
|
| 516 |
+
duration_control_enabled.change(
|
| 517 |
+
fn=update_duration_controls,
|
| 518 |
+
inputs=[duration_control_enabled, text, duration_tokens, mode_with_reference],
|
| 519 |
+
outputs=[duration_tokens, duration_hint, duration_control_enabled],
|
| 520 |
+
)
|
| 521 |
+
text.change(
|
| 522 |
+
fn=update_duration_controls,
|
| 523 |
+
inputs=[duration_control_enabled, text, duration_tokens, mode_with_reference],
|
| 524 |
+
outputs=[duration_tokens, duration_hint, duration_control_enabled],
|
| 525 |
+
)
|
| 526 |
+
mode_with_reference.change(
|
| 527 |
+
fn=update_duration_controls,
|
| 528 |
+
inputs=[duration_control_enabled, text, duration_tokens, mode_with_reference],
|
| 529 |
+
outputs=[duration_tokens, duration_hint, duration_control_enabled],
|
| 530 |
+
)
|
| 531 |
+
examples_table.select(
|
| 532 |
+
fn=apply_example_selection,
|
| 533 |
+
inputs=[mode_with_reference, duration_control_enabled, duration_tokens],
|
| 534 |
+
outputs=[
|
| 535 |
+
reference_audio,
|
| 536 |
+
text,
|
| 537 |
+
mode_hint,
|
| 538 |
+
duration_tokens,
|
| 539 |
+
duration_hint,
|
| 540 |
+
duration_control_enabled,
|
| 541 |
+
],
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
run_btn.click(
|
| 545 |
+
fn=lambda text, reference_audio, mode_with_reference, duration_control_enabled, duration_tokens, temperature, top_p, top_k, repetition_penalty, max_new_tokens: run_inference(
|
| 546 |
+
text=text,
|
| 547 |
+
reference_audio=reference_audio,
|
| 548 |
+
mode_with_reference=mode_with_reference,
|
| 549 |
+
duration_control_enabled=duration_control_enabled,
|
| 550 |
+
duration_tokens=duration_tokens,
|
| 551 |
+
temperature=temperature,
|
| 552 |
+
top_p=top_p,
|
| 553 |
+
top_k=top_k,
|
| 554 |
+
repetition_penalty=repetition_penalty,
|
| 555 |
+
model_path=args.model_path,
|
| 556 |
+
device=args.device,
|
| 557 |
+
attn_implementation=args.attn_implementation,
|
| 558 |
+
max_new_tokens=max_new_tokens,
|
| 559 |
+
),
|
| 560 |
+
inputs=[
|
| 561 |
+
text,
|
| 562 |
+
reference_audio,
|
| 563 |
+
mode_with_reference,
|
| 564 |
+
duration_control_enabled,
|
| 565 |
+
duration_tokens,
|
| 566 |
+
temperature,
|
| 567 |
+
top_p,
|
| 568 |
+
top_k,
|
| 569 |
+
repetition_penalty,
|
| 570 |
+
max_new_tokens,
|
| 571 |
+
],
|
| 572 |
+
outputs=[output_audio, status],
|
| 573 |
+
)
|
| 574 |
+
return demo
|
| 575 |
+
|
| 576 |
+
|
| 577 |
+
def main():
|
| 578 |
+
parser = argparse.ArgumentParser(description="MossTTS Gradio Demo")
|
| 579 |
+
parser.add_argument("--model_path", type=str, default=MODEL_PATH)
|
| 580 |
+
parser.add_argument("--device", type=str, default="cuda:0")
|
| 581 |
+
parser.add_argument("--attn_implementation", type=str, default=DEFAULT_ATTN_IMPLEMENTATION)
|
| 582 |
+
parser.add_argument("--host", type=str, default="0.0.0.0")
|
| 583 |
+
parser.add_argument("--port", type=int, default=7860)
|
| 584 |
+
parser.add_argument("--share", action="store_true")
|
| 585 |
+
args = parser.parse_args()
|
| 586 |
+
|
| 587 |
+
runtime_device = torch.device(args.device if torch.cuda.is_available() else "cpu")
|
| 588 |
+
runtime_dtype = torch.bfloat16 if runtime_device.type == "cuda" else torch.float32
|
| 589 |
+
args.attn_implementation = resolve_attn_implementation(
|
| 590 |
+
requested=args.attn_implementation,
|
| 591 |
+
device=runtime_device,
|
| 592 |
+
dtype=runtime_dtype,
|
| 593 |
+
) or "none"
|
| 594 |
+
print(f"[INFO] Using attn_implementation={args.attn_implementation}", flush=True)
|
| 595 |
+
|
| 596 |
+
# Preload model/processor at startup to avoid first-request cold start latency.
|
| 597 |
+
preload_started_at = time.monotonic()
|
| 598 |
+
print(
|
| 599 |
+
f"[Startup] Preloading backend: model={args.model_path}, device={args.device}, attn={args.attn_implementation}",
|
| 600 |
+
flush=True,
|
| 601 |
+
)
|
| 602 |
+
load_backend(
|
| 603 |
+
model_path=args.model_path,
|
| 604 |
+
device_str=args.device,
|
| 605 |
+
attn_implementation=args.attn_implementation,
|
| 606 |
+
)
|
| 607 |
+
print(
|
| 608 |
+
f"[Startup] Backend preload finished in {time.monotonic() - preload_started_at:.2f}s",
|
| 609 |
+
flush=True,
|
| 610 |
+
)
|
| 611 |
+
|
| 612 |
+
demo = build_demo(args)
|
| 613 |
+
demo.queue(max_size=16, default_concurrency_limit=1).launch(
|
| 614 |
+
server_name=args.host,
|
| 615 |
+
server_port=args.port,
|
| 616 |
+
share=args.share,
|
| 617 |
+
)
|
| 618 |
+
|
| 619 |
+
|
| 620 |
+
if __name__ == "__main__":
|
| 621 |
+
main()
|
clis/moss_ttsd_app.py
ADDED
|
@@ -0,0 +1,811 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import functools
|
| 3 |
+
import importlib.util
|
| 4 |
+
import re
|
| 5 |
+
import time
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
import gradio as gr
|
| 10 |
+
import numpy as np
|
| 11 |
+
import torch
|
| 12 |
+
import torchaudio
|
| 13 |
+
from transformers import AutoModel, AutoProcessor
|
| 14 |
+
|
| 15 |
+
# Disable the broken cuDNN SDPA backend
|
| 16 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 17 |
+
# Keep these enabled as fallbacks
|
| 18 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 19 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 20 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 21 |
+
|
| 22 |
+
MODEL_PATH = "OpenMOSS-Team/MOSS-TTSD-v1.0"
|
| 23 |
+
CODEC_MODEL_PATH = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
|
| 24 |
+
DEFAULT_ATTN_IMPLEMENTATION = "auto"
|
| 25 |
+
DEFAULT_MAX_NEW_TOKENS = 2000
|
| 26 |
+
MIN_SPEAKERS = 1
|
| 27 |
+
MAX_SPEAKERS = 5
|
| 28 |
+
PRESET_REF_AUDIO_S1 = "assets/audio/reference_02_s1.wav"
|
| 29 |
+
PRESET_REF_AUDIO_S2 = "assets/audio/reference_02_s2.wav"
|
| 30 |
+
PRESET_PROMPT_TEXT_S1 = (
|
| 31 |
+
"[S1] In short, we embarked on a mission to make America great again for all Americans."
|
| 32 |
+
)
|
| 33 |
+
PRESET_PROMPT_TEXT_S2 = (
|
| 34 |
+
"[S2] NVIDIA reinvented computing for the first time after 60 years. In fact, Erwin at IBM knows quite "
|
| 35 |
+
"well that the computer has largely been the same since the 60s."
|
| 36 |
+
)
|
| 37 |
+
PRESET_DIALOGUE_TEXT = (
|
| 38 |
+
"[S1] Listen, let's talk business. China. I'm hearing things.\n"
|
| 39 |
+
"People are saying they're catching up. Fast. What's the real scoop?\n"
|
| 40 |
+
"Their AI, is it a threat?\n"
|
| 41 |
+
"[S2] Well, the pace of innovation there is extraordinary, honestly.\n"
|
| 42 |
+
"They have the researchers, and they have the drive.\n"
|
| 43 |
+
"[S1] Extraordinary? I don't like that. I want us to be extraordinary.\n"
|
| 44 |
+
"Are they winning?\n"
|
| 45 |
+
"[S2] I wouldn't say winning, but their progress is very promising.\n"
|
| 46 |
+
"They are building massive clusters. They're very determined.\n"
|
| 47 |
+
"[S1] Promising. There it is. I hate that word.\n"
|
| 48 |
+
"When China is promising, it means we're losing.\n"
|
| 49 |
+
"It's a disaster, Jensen. A total disaster."
|
| 50 |
+
)
|
| 51 |
+
PRESET_EXAMPLES = [
|
| 52 |
+
{
|
| 53 |
+
"name": "Quick Start | reference_02_s1/s2",
|
| 54 |
+
"speaker_count": 2,
|
| 55 |
+
"s1_audio": PRESET_REF_AUDIO_S1,
|
| 56 |
+
"s1_prompt": PRESET_PROMPT_TEXT_S1,
|
| 57 |
+
"s2_audio": PRESET_REF_AUDIO_S2,
|
| 58 |
+
"s2_prompt": PRESET_PROMPT_TEXT_S2,
|
| 59 |
+
"dialogue_text": PRESET_DIALOGUE_TEXT,
|
| 60 |
+
}
|
| 61 |
+
]
|
| 62 |
+
PRESET_DISPLAY_FIELDS = [
|
| 63 |
+
("Speaker Count", "speaker_count"),
|
| 64 |
+
("S1 Reference Audio (Optional)", "s1_audio"),
|
| 65 |
+
("S1 Prompt Text (Required with reference audio)", "s1_prompt"),
|
| 66 |
+
("S2 Reference Audio (Optional)", "s2_audio"),
|
| 67 |
+
("S2 Prompt Text (Required with reference audio)", "s2_prompt"),
|
| 68 |
+
("Dialogue Text", "dialogue_text"),
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def _build_preset_table_rows():
|
| 73 |
+
rows = []
|
| 74 |
+
row_to_preset = []
|
| 75 |
+
for preset_idx, preset in enumerate(PRESET_EXAMPLES):
|
| 76 |
+
for field_name, field_key in PRESET_DISPLAY_FIELDS:
|
| 77 |
+
value = str(preset.get(field_key, ""))
|
| 78 |
+
if field_key == "dialogue_text":
|
| 79 |
+
value = value.replace("\n", " ").strip()
|
| 80 |
+
if len(value) > 120:
|
| 81 |
+
value = value[:120] + " ..."
|
| 82 |
+
rows.append([field_name, value])
|
| 83 |
+
row_to_preset.append(preset_idx)
|
| 84 |
+
return rows, row_to_preset
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
PRESET_TABLE_ROWS, PRESET_TABLE_ROW_TO_PRESET = _build_preset_table_rows()
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def resolve_attn_implementation(requested: str, device: torch.device, dtype: torch.dtype) -> str | None:
|
| 91 |
+
requested_norm = (requested or "").strip().lower()
|
| 92 |
+
|
| 93 |
+
if requested_norm in {"none"}:
|
| 94 |
+
return None
|
| 95 |
+
|
| 96 |
+
if requested_norm not in {"", "auto"}:
|
| 97 |
+
return requested
|
| 98 |
+
|
| 99 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 100 |
+
if (
|
| 101 |
+
device.type == "cuda"
|
| 102 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 103 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 104 |
+
):
|
| 105 |
+
major, _ = torch.cuda.get_device_capability(device)
|
| 106 |
+
if major >= 8:
|
| 107 |
+
return "flash_attention_2"
|
| 108 |
+
|
| 109 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 110 |
+
if device.type == "cuda":
|
| 111 |
+
return "sdpa"
|
| 112 |
+
|
| 113 |
+
# CPU fallback.
|
| 114 |
+
return "eager"
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
@functools.lru_cache(maxsize=1)
|
| 118 |
+
def load_backend(model_path: str, codec_path: str, device_str: str, attn_implementation: str):
|
| 119 |
+
device = torch.device(device_str if torch.cuda.is_available() else "cpu")
|
| 120 |
+
dtype = torch.bfloat16 if device.type == "cuda" else torch.float32
|
| 121 |
+
resolved_attn_implementation = resolve_attn_implementation(
|
| 122 |
+
requested=attn_implementation,
|
| 123 |
+
device=device,
|
| 124 |
+
dtype=dtype,
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
processor = AutoProcessor.from_pretrained(
|
| 128 |
+
model_path,
|
| 129 |
+
trust_remote_code=True,
|
| 130 |
+
codec_path=codec_path,
|
| 131 |
+
)
|
| 132 |
+
if hasattr(processor, "audio_tokenizer"):
|
| 133 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 134 |
+
processor.audio_tokenizer.eval()
|
| 135 |
+
|
| 136 |
+
model_kwargs = {
|
| 137 |
+
"trust_remote_code": True,
|
| 138 |
+
"torch_dtype": dtype,
|
| 139 |
+
}
|
| 140 |
+
if resolved_attn_implementation:
|
| 141 |
+
model_kwargs["attn_implementation"] = resolved_attn_implementation
|
| 142 |
+
|
| 143 |
+
model = AutoModel.from_pretrained(model_path, **model_kwargs).to(device)
|
| 144 |
+
model.eval()
|
| 145 |
+
|
| 146 |
+
sample_rate = int(getattr(processor.model_config, "sampling_rate", 24000))
|
| 147 |
+
return model, processor, device, sample_rate
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def _resample_wav(wav: torch.Tensor, orig_sr: int, target_sr: int) -> torch.Tensor:
|
| 151 |
+
if int(orig_sr) == int(target_sr):
|
| 152 |
+
return wav
|
| 153 |
+
new_num_samples = int(round(wav.shape[-1] * float(target_sr) / float(orig_sr)))
|
| 154 |
+
if new_num_samples <= 0:
|
| 155 |
+
raise ValueError(f"Invalid resample length from {orig_sr}Hz to {target_sr}Hz.")
|
| 156 |
+
return torch.nn.functional.interpolate(
|
| 157 |
+
wav.unsqueeze(0),
|
| 158 |
+
size=new_num_samples,
|
| 159 |
+
mode="linear",
|
| 160 |
+
align_corners=False,
|
| 161 |
+
).squeeze(0)
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
def _load_audio(audio_path: str) -> tuple[torch.Tensor, int]:
|
| 165 |
+
path = Path(audio_path).expanduser()
|
| 166 |
+
if not path.exists():
|
| 167 |
+
raise FileNotFoundError(f"Reference audio not found: {path}")
|
| 168 |
+
|
| 169 |
+
wav, sr = torchaudio.load(str(path))
|
| 170 |
+
if wav.numel() == 0:
|
| 171 |
+
raise ValueError(f"Reference audio is empty: {path}")
|
| 172 |
+
|
| 173 |
+
if wav.shape[0] > 1:
|
| 174 |
+
wav = wav.mean(dim=0, keepdim=True)
|
| 175 |
+
|
| 176 |
+
return wav, int(sr)
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
def normalize_text(text: str) -> str:
|
| 180 |
+
text = re.sub(r"\[(\d+)\]", r"[S\1]", text)
|
| 181 |
+
remove_chars = "【】《》()『』「」" '"-_“”~~‘’'
|
| 182 |
+
|
| 183 |
+
segments = re.split(r"(?=\[S\d+\])", text.replace("\n", " "))
|
| 184 |
+
processed_parts = []
|
| 185 |
+
for seg in segments:
|
| 186 |
+
seg = seg.strip()
|
| 187 |
+
if not seg:
|
| 188 |
+
continue
|
| 189 |
+
|
| 190 |
+
matched = re.match(r"^(\[S\d+\])\s*(.*)", seg)
|
| 191 |
+
tag, content = matched.groups() if matched else ("", seg)
|
| 192 |
+
|
| 193 |
+
content = re.sub(f"[{re.escape(remove_chars)}]", "", content)
|
| 194 |
+
content = re.sub(r"哈{2,}", "[笑]", content)
|
| 195 |
+
content = re.sub(r"\b(ha(\s*ha)+)\b", "[laugh]", content, flags=re.IGNORECASE)
|
| 196 |
+
|
| 197 |
+
content = content.replace("——", ",")
|
| 198 |
+
content = content.replace("……", ",")
|
| 199 |
+
content = content.replace("...", ",")
|
| 200 |
+
content = content.replace("⸺", ",")
|
| 201 |
+
content = content.replace("―", ",")
|
| 202 |
+
content = content.replace("—", ",")
|
| 203 |
+
content = content.replace("…", ",")
|
| 204 |
+
|
| 205 |
+
internal_punct_map = str.maketrans(
|
| 206 |
+
{";": ",", ";": ",", ":": ",", ":": ",", "、": ","}
|
| 207 |
+
)
|
| 208 |
+
content = content.translate(internal_punct_map)
|
| 209 |
+
content = content.strip()
|
| 210 |
+
content = re.sub(r"([,。?!,.?!])[,。?!,.?!]+", r"\1", content)
|
| 211 |
+
|
| 212 |
+
if len(content) > 1:
|
| 213 |
+
last_ch = "。" if content[-1] == "," else ("." if content[-1] == "," else content[-1])
|
| 214 |
+
body = content[:-1].replace("。", ",")
|
| 215 |
+
content = body + last_ch
|
| 216 |
+
|
| 217 |
+
processed_parts.append({"tag": tag, "content": content})
|
| 218 |
+
|
| 219 |
+
if not processed_parts:
|
| 220 |
+
return ""
|
| 221 |
+
|
| 222 |
+
merged_lines = []
|
| 223 |
+
current_tag = processed_parts[0]["tag"]
|
| 224 |
+
current_content = [processed_parts[0]["content"]]
|
| 225 |
+
for part in processed_parts[1:]:
|
| 226 |
+
if part["tag"] == current_tag and current_tag:
|
| 227 |
+
current_content.append(part["content"])
|
| 228 |
+
else:
|
| 229 |
+
merged_lines.append(f"{current_tag}{''.join(current_content)}".strip())
|
| 230 |
+
current_tag = part["tag"]
|
| 231 |
+
current_content = [part["content"]]
|
| 232 |
+
merged_lines.append(f"{current_tag}{''.join(current_content)}".strip())
|
| 233 |
+
|
| 234 |
+
return "".join(merged_lines).replace("‘", "'").replace("’", "'")
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def _validate_dialogue_text(dialogue_text: str, speaker_count: int) -> str:
|
| 238 |
+
text = (dialogue_text or "").strip()
|
| 239 |
+
if not text:
|
| 240 |
+
raise ValueError("Please enter dialogue text.")
|
| 241 |
+
|
| 242 |
+
tags = re.findall(r"\[S(\d+)\]", text)
|
| 243 |
+
if not tags:
|
| 244 |
+
raise ValueError("Dialogue must include speaker tags like [S1], [S2], ...")
|
| 245 |
+
|
| 246 |
+
max_tag = max(int(t) for t in tags)
|
| 247 |
+
if max_tag > speaker_count:
|
| 248 |
+
raise ValueError(
|
| 249 |
+
f"Dialogue contains [S{max_tag}], but speaker count is set to {speaker_count}."
|
| 250 |
+
)
|
| 251 |
+
return text
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
def update_speaker_panels(speaker_count: int):
|
| 255 |
+
count = int(speaker_count)
|
| 256 |
+
count = max(MIN_SPEAKERS, min(MAX_SPEAKERS, count))
|
| 257 |
+
return [gr.update(visible=(idx < count)) for idx in range(MAX_SPEAKERS)]
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def apply_preset_selection(evt: gr.SelectData):
|
| 261 |
+
if evt is None or evt.index is None:
|
| 262 |
+
return (
|
| 263 |
+
gr.update(),
|
| 264 |
+
gr.update(),
|
| 265 |
+
gr.update(),
|
| 266 |
+
gr.update(),
|
| 267 |
+
gr.update(),
|
| 268 |
+
gr.update(),
|
| 269 |
+
*[gr.update() for _ in range(MAX_SPEAKERS)],
|
| 270 |
+
)
|
| 271 |
+
|
| 272 |
+
if isinstance(evt.index, (tuple, list)):
|
| 273 |
+
row_idx = int(evt.index[0])
|
| 274 |
+
else:
|
| 275 |
+
row_idx = int(evt.index)
|
| 276 |
+
|
| 277 |
+
if row_idx < 0 or row_idx >= len(PRESET_TABLE_ROW_TO_PRESET):
|
| 278 |
+
return (
|
| 279 |
+
gr.update(),
|
| 280 |
+
gr.update(),
|
| 281 |
+
gr.update(),
|
| 282 |
+
gr.update(),
|
| 283 |
+
gr.update(),
|
| 284 |
+
gr.update(),
|
| 285 |
+
*[gr.update() for _ in range(MAX_SPEAKERS)],
|
| 286 |
+
)
|
| 287 |
+
|
| 288 |
+
preset_idx = PRESET_TABLE_ROW_TO_PRESET[row_idx]
|
| 289 |
+
if preset_idx < 0 or preset_idx >= len(PRESET_EXAMPLES):
|
| 290 |
+
return (
|
| 291 |
+
gr.update(),
|
| 292 |
+
gr.update(),
|
| 293 |
+
gr.update(),
|
| 294 |
+
gr.update(),
|
| 295 |
+
gr.update(),
|
| 296 |
+
gr.update(),
|
| 297 |
+
*[gr.update() for _ in range(MAX_SPEAKERS)],
|
| 298 |
+
)
|
| 299 |
+
|
| 300 |
+
preset = PRESET_EXAMPLES[preset_idx]
|
| 301 |
+
panel_updates = update_speaker_panels(int(preset["speaker_count"]))
|
| 302 |
+
return (
|
| 303 |
+
gr.update(value=int(preset["speaker_count"])),
|
| 304 |
+
gr.update(value=str(preset["s1_audio"])),
|
| 305 |
+
gr.update(value=str(preset["s1_prompt"])),
|
| 306 |
+
gr.update(value=str(preset["s2_audio"])),
|
| 307 |
+
gr.update(value=str(preset["s2_prompt"])),
|
| 308 |
+
gr.update(value=str(preset["dialogue_text"])),
|
| 309 |
+
*panel_updates,
|
| 310 |
+
)
|
| 311 |
+
|
| 312 |
+
|
| 313 |
+
def _merge_consecutive_speaker_tags(text: str) -> str:
|
| 314 |
+
segments = re.split(r"(?=\[S\d+\])", text)
|
| 315 |
+
if not segments:
|
| 316 |
+
return text
|
| 317 |
+
|
| 318 |
+
merged_parts = []
|
| 319 |
+
current_tag = None
|
| 320 |
+
for seg in segments:
|
| 321 |
+
seg = seg.strip()
|
| 322 |
+
if not seg:
|
| 323 |
+
continue
|
| 324 |
+
matched = re.match(r"^(\[S\d+\])\s*(.*)", seg, re.DOTALL)
|
| 325 |
+
if not matched:
|
| 326 |
+
merged_parts.append(seg)
|
| 327 |
+
continue
|
| 328 |
+
tag, content = matched.groups()
|
| 329 |
+
if tag == current_tag:
|
| 330 |
+
merged_parts.append(content)
|
| 331 |
+
else:
|
| 332 |
+
current_tag = tag
|
| 333 |
+
merged_parts.append(f"{tag}{content}")
|
| 334 |
+
return "".join(merged_parts)
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
def _normalize_prompt_text(prompt_text: str, speaker_id: int) -> str:
|
| 338 |
+
text = (prompt_text or "").strip()
|
| 339 |
+
if not text:
|
| 340 |
+
raise ValueError(f"S{speaker_id} prompt text is empty.")
|
| 341 |
+
|
| 342 |
+
expected_tag = f"[S{speaker_id}]"
|
| 343 |
+
if not text.lstrip().startswith(expected_tag):
|
| 344 |
+
text = f"{expected_tag} {text}"
|
| 345 |
+
return text
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
def _build_prefixed_text(
|
| 349 |
+
dialogue_text: str,
|
| 350 |
+
prompt_text_map: dict[int, str],
|
| 351 |
+
cloned_speakers: list[int],
|
| 352 |
+
) -> str:
|
| 353 |
+
prompt_prefix = "".join([prompt_text_map[speaker_id] for speaker_id in cloned_speakers])
|
| 354 |
+
return _merge_consecutive_speaker_tags(prompt_prefix + dialogue_text)
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
def _encode_reference_audio_codes(
|
| 358 |
+
processor,
|
| 359 |
+
clone_wavs: list[torch.Tensor],
|
| 360 |
+
cloned_speakers: list[int],
|
| 361 |
+
speaker_count: int,
|
| 362 |
+
sample_rate: int,
|
| 363 |
+
) -> list[Optional[torch.Tensor]]:
|
| 364 |
+
encoded_list = processor.encode_audios_from_wav(clone_wavs, sampling_rate=sample_rate)
|
| 365 |
+
reference_audio_codes: list[Optional[torch.Tensor]] = [None for _ in range(speaker_count)]
|
| 366 |
+
for speaker_id, audio_codes in zip(cloned_speakers, encoded_list):
|
| 367 |
+
reference_audio_codes[speaker_id - 1] = audio_codes
|
| 368 |
+
return reference_audio_codes
|
| 369 |
+
|
| 370 |
+
|
| 371 |
+
def build_conversation(
|
| 372 |
+
dialogue_text: str,
|
| 373 |
+
reference_audio_codes: list[Optional[torch.Tensor]],
|
| 374 |
+
prompt_audio: torch.Tensor | None,
|
| 375 |
+
processor,
|
| 376 |
+
):
|
| 377 |
+
if prompt_audio is None:
|
| 378 |
+
return [[processor.build_user_message(text=dialogue_text)]], "generation", "Generation"
|
| 379 |
+
|
| 380 |
+
user_message = processor.build_user_message(
|
| 381 |
+
text=dialogue_text,
|
| 382 |
+
reference=reference_audio_codes,
|
| 383 |
+
)
|
| 384 |
+
return (
|
| 385 |
+
[
|
| 386 |
+
[
|
| 387 |
+
user_message,
|
| 388 |
+
processor.build_assistant_message(audio_codes_list=[prompt_audio]),
|
| 389 |
+
],
|
| 390 |
+
],
|
| 391 |
+
"continuation",
|
| 392 |
+
"voice_clone_and_continuation",
|
| 393 |
+
)
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
def run_inference(speaker_count: int, *all_inputs):
|
| 397 |
+
speaker_count = int(speaker_count)
|
| 398 |
+
speaker_count = max(MIN_SPEAKERS, min(MAX_SPEAKERS, speaker_count))
|
| 399 |
+
|
| 400 |
+
reference_audio_values = all_inputs[:MAX_SPEAKERS]
|
| 401 |
+
prompt_text_values = all_inputs[MAX_SPEAKERS : 2 * MAX_SPEAKERS]
|
| 402 |
+
dialogue_text = all_inputs[2 * MAX_SPEAKERS]
|
| 403 |
+
text_normalize, sample_rate_normalize, temperature, top_p, top_k, repetition_penalty, max_new_tokens, model_path, codec_path, device, attn_implementation = all_inputs[
|
| 404 |
+
2 * MAX_SPEAKERS + 1 :
|
| 405 |
+
]
|
| 406 |
+
|
| 407 |
+
started_at = time.monotonic()
|
| 408 |
+
model, processor, torch_device, sample_rate = load_backend(
|
| 409 |
+
model_path=str(model_path),
|
| 410 |
+
codec_path=str(codec_path),
|
| 411 |
+
device_str=str(device),
|
| 412 |
+
attn_implementation=str(attn_implementation),
|
| 413 |
+
)
|
| 414 |
+
|
| 415 |
+
text_normalize = bool(text_normalize)
|
| 416 |
+
sample_rate_normalize = bool(sample_rate_normalize)
|
| 417 |
+
|
| 418 |
+
normalized_dialogue = str(dialogue_text or "").strip()
|
| 419 |
+
if text_normalize:
|
| 420 |
+
normalized_dialogue = normalize_text(normalized_dialogue)
|
| 421 |
+
normalized_dialogue = _validate_dialogue_text(normalized_dialogue, speaker_count)
|
| 422 |
+
|
| 423 |
+
cloned_speakers: list[int] = []
|
| 424 |
+
loaded_clone_wavs: list[tuple[torch.Tensor, int]] = []
|
| 425 |
+
prompt_text_map: dict[int, str] = {}
|
| 426 |
+
for idx in range(speaker_count):
|
| 427 |
+
ref_audio = reference_audio_values[idx]
|
| 428 |
+
prompt_text = str(prompt_text_values[idx] or "").strip()
|
| 429 |
+
|
| 430 |
+
has_reference = bool(ref_audio)
|
| 431 |
+
has_prompt_text = bool(prompt_text)
|
| 432 |
+
if has_reference != has_prompt_text:
|
| 433 |
+
raise ValueError(
|
| 434 |
+
f"S{idx + 1} must provide both reference audio and prompt text together."
|
| 435 |
+
)
|
| 436 |
+
|
| 437 |
+
if has_reference:
|
| 438 |
+
speaker_id = idx + 1
|
| 439 |
+
ref_audio_path = str(ref_audio)
|
| 440 |
+
cloned_speakers.append(speaker_id)
|
| 441 |
+
loaded_clone_wavs.append(_load_audio(ref_audio_path))
|
| 442 |
+
prompt_text_map[speaker_id] = _normalize_prompt_text(prompt_text, speaker_id)
|
| 443 |
+
|
| 444 |
+
prompt_audio: Optional[torch.Tensor] = None
|
| 445 |
+
reference_audio_codes: list[Optional[torch.Tensor]] = []
|
| 446 |
+
conversation_text = normalized_dialogue
|
| 447 |
+
if cloned_speakers:
|
| 448 |
+
conversation_text = _build_prefixed_text(
|
| 449 |
+
dialogue_text=normalized_dialogue,
|
| 450 |
+
prompt_text_map=prompt_text_map,
|
| 451 |
+
cloned_speakers=cloned_speakers,
|
| 452 |
+
)
|
| 453 |
+
if text_normalize:
|
| 454 |
+
conversation_text = normalize_text(conversation_text)
|
| 455 |
+
conversation_text = _validate_dialogue_text(conversation_text, speaker_count)
|
| 456 |
+
|
| 457 |
+
if sample_rate_normalize:
|
| 458 |
+
min_sr = min(sr for _, sr in loaded_clone_wavs)
|
| 459 |
+
else:
|
| 460 |
+
min_sr = None
|
| 461 |
+
|
| 462 |
+
clone_wavs: list[torch.Tensor] = []
|
| 463 |
+
for wav, orig_sr in loaded_clone_wavs:
|
| 464 |
+
processed_wav = wav
|
| 465 |
+
current_sr = int(orig_sr)
|
| 466 |
+
if min_sr is not None:
|
| 467 |
+
processed_wav = _resample_wav(processed_wav, current_sr, int(min_sr))
|
| 468 |
+
current_sr = int(min_sr)
|
| 469 |
+
processed_wav = _resample_wav(processed_wav, current_sr, sample_rate)
|
| 470 |
+
clone_wavs.append(processed_wav)
|
| 471 |
+
|
| 472 |
+
reference_audio_codes = _encode_reference_audio_codes(
|
| 473 |
+
processor=processor,
|
| 474 |
+
clone_wavs=clone_wavs,
|
| 475 |
+
cloned_speakers=cloned_speakers,
|
| 476 |
+
speaker_count=speaker_count,
|
| 477 |
+
sample_rate=sample_rate,
|
| 478 |
+
)
|
| 479 |
+
concat_prompt_wav = torch.cat(clone_wavs, dim=-1)
|
| 480 |
+
prompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=sample_rate)[0]
|
| 481 |
+
|
| 482 |
+
conversations, mode, mode_name = build_conversation(
|
| 483 |
+
dialogue_text=conversation_text,
|
| 484 |
+
reference_audio_codes=reference_audio_codes,
|
| 485 |
+
prompt_audio=prompt_audio,
|
| 486 |
+
processor=processor,
|
| 487 |
+
)
|
| 488 |
+
|
| 489 |
+
batch = processor(conversations, mode=mode)
|
| 490 |
+
input_ids = batch["input_ids"].to(torch_device)
|
| 491 |
+
attention_mask = batch["attention_mask"].to(torch_device)
|
| 492 |
+
|
| 493 |
+
with torch.no_grad():
|
| 494 |
+
outputs = model.generate(
|
| 495 |
+
input_ids=input_ids,
|
| 496 |
+
attention_mask=attention_mask,
|
| 497 |
+
max_new_tokens=int(max_new_tokens),
|
| 498 |
+
audio_temperature=float(temperature),
|
| 499 |
+
audio_top_p=float(top_p),
|
| 500 |
+
audio_top_k=int(top_k),
|
| 501 |
+
audio_repetition_penalty=float(repetition_penalty),
|
| 502 |
+
)
|
| 503 |
+
|
| 504 |
+
messages = processor.decode(outputs)
|
| 505 |
+
if not messages or messages[0] is None:
|
| 506 |
+
raise RuntimeError("The model did not return a decodable audio result.")
|
| 507 |
+
|
| 508 |
+
audio = messages[0].audio_codes_list[0]
|
| 509 |
+
if isinstance(audio, torch.Tensor):
|
| 510 |
+
audio_np = audio.detach().float().cpu().numpy()
|
| 511 |
+
else:
|
| 512 |
+
audio_np = np.asarray(audio, dtype=np.float32)
|
| 513 |
+
|
| 514 |
+
if audio_np.ndim > 1:
|
| 515 |
+
audio_np = audio_np.reshape(-1)
|
| 516 |
+
audio_np = audio_np.astype(np.float32, copy=False)
|
| 517 |
+
|
| 518 |
+
clone_summary = "none" if not cloned_speakers else ",".join([f"S{i}" for i in cloned_speakers])
|
| 519 |
+
elapsed = time.monotonic() - started_at
|
| 520 |
+
status = (
|
| 521 |
+
f"Done | mode={mode_name} | speakers={speaker_count} | cloned={clone_summary} | elapsed={elapsed:.2f}s | "
|
| 522 |
+
f"text_normalize={text_normalize}, sample_rate_normalize={sample_rate_normalize} | "
|
| 523 |
+
f"max_new_tokens={int(max_new_tokens)}, "
|
| 524 |
+
f"audio_temperature={float(temperature):.2f}, audio_top_p={float(top_p):.2f}, "
|
| 525 |
+
f"audio_top_k={int(top_k)}, audio_repetition_penalty={float(repetition_penalty):.2f}"
|
| 526 |
+
)
|
| 527 |
+
return (sample_rate, audio_np), status
|
| 528 |
+
|
| 529 |
+
|
| 530 |
+
def build_demo(args: argparse.Namespace):
|
| 531 |
+
custom_css = """
|
| 532 |
+
:root {
|
| 533 |
+
--bg: #f6f7f8;
|
| 534 |
+
--panel: #ffffff;
|
| 535 |
+
--ink: #111418;
|
| 536 |
+
--muted: #4d5562;
|
| 537 |
+
--line: #e5e7eb;
|
| 538 |
+
--accent: #0f766e;
|
| 539 |
+
}
|
| 540 |
+
.gradio-container {
|
| 541 |
+
background: linear-gradient(180deg, #f7f8fa 0%, #f3f5f7 100%);
|
| 542 |
+
color: var(--ink);
|
| 543 |
+
}
|
| 544 |
+
.app-card {
|
| 545 |
+
border: 1px solid var(--line);
|
| 546 |
+
border-radius: 16px;
|
| 547 |
+
background: var(--panel);
|
| 548 |
+
padding: 14px;
|
| 549 |
+
}
|
| 550 |
+
.app-title {
|
| 551 |
+
font-size: 22px;
|
| 552 |
+
font-weight: 700;
|
| 553 |
+
margin-bottom: 6px;
|
| 554 |
+
letter-spacing: 0.2px;
|
| 555 |
+
}
|
| 556 |
+
.app-subtitle {
|
| 557 |
+
color: var(--muted);
|
| 558 |
+
font-size: 14px;
|
| 559 |
+
margin-bottom: 8px;
|
| 560 |
+
}
|
| 561 |
+
#output_panel {
|
| 562 |
+
overflow: hidden !important;
|
| 563 |
+
}
|
| 564 |
+
#output_audio {
|
| 565 |
+
padding-bottom: 24px;
|
| 566 |
+
margin-bottom: 0;
|
| 567 |
+
overflow: hidden !important;
|
| 568 |
+
}
|
| 569 |
+
#output_audio > .wrap,
|
| 570 |
+
#output_audio .wrap,
|
| 571 |
+
#output_audio .audio-container,
|
| 572 |
+
#output_audio .block {
|
| 573 |
+
overflow: hidden !important;
|
| 574 |
+
}
|
| 575 |
+
#output_audio .audio-container {
|
| 576 |
+
padding-bottom: 10px;
|
| 577 |
+
min-height: 96px;
|
| 578 |
+
}
|
| 579 |
+
#output_audio_spacer {
|
| 580 |
+
height: 12px;
|
| 581 |
+
}
|
| 582 |
+
#output_status {
|
| 583 |
+
margin-top: 0;
|
| 584 |
+
}
|
| 585 |
+
#run-btn {
|
| 586 |
+
background: var(--accent);
|
| 587 |
+
border: none;
|
| 588 |
+
}
|
| 589 |
+
"""
|
| 590 |
+
|
| 591 |
+
with gr.Blocks(title="MOSS-TTSD Demo", css=custom_css) as demo:
|
| 592 |
+
gr.Markdown(
|
| 593 |
+
"""
|
| 594 |
+
<div class="app-card">
|
| 595 |
+
<div class="app-title">MOSS-TTSD</div>
|
| 596 |
+
<div class="app-subtitle">Multi-speaker dialogue synthesis with optional per-speaker voice cloning.</div>
|
| 597 |
+
</div>
|
| 598 |
+
"""
|
| 599 |
+
)
|
| 600 |
+
|
| 601 |
+
speaker_panels: list[gr.Group] = []
|
| 602 |
+
speaker_refs = []
|
| 603 |
+
speaker_prompts = []
|
| 604 |
+
|
| 605 |
+
with gr.Row(equal_height=False):
|
| 606 |
+
with gr.Column(scale=3):
|
| 607 |
+
speaker_count = gr.Slider(
|
| 608 |
+
minimum=MIN_SPEAKERS,
|
| 609 |
+
maximum=MAX_SPEAKERS,
|
| 610 |
+
step=1,
|
| 611 |
+
value=2,
|
| 612 |
+
label="Speaker Count",
|
| 613 |
+
info="Default 2 speakers. Minimum 1, maximum 5.",
|
| 614 |
+
)
|
| 615 |
+
|
| 616 |
+
gr.Markdown("### Voice Cloning (Optional, placed first)")
|
| 617 |
+
gr.Markdown(
|
| 618 |
+
"If you provide reference audio for a speaker, you must also provide that speaker's prompt text. "
|
| 619 |
+
"Prompt text may omit [Sx]; the app will auto-prepend it."
|
| 620 |
+
)
|
| 621 |
+
|
| 622 |
+
for idx in range(1, MAX_SPEAKERS + 1):
|
| 623 |
+
with gr.Group(visible=idx <= 2) as panel:
|
| 624 |
+
speaker_ref = gr.Audio(
|
| 625 |
+
label=f"S{idx} Reference Audio (Optional)",
|
| 626 |
+
type="filepath",
|
| 627 |
+
)
|
| 628 |
+
speaker_prompt = gr.Textbox(
|
| 629 |
+
label=f"S{idx} Prompt Text (Required with reference audio)",
|
| 630 |
+
lines=2,
|
| 631 |
+
placeholder=f"Example: [S{idx}] This is a prompt line for S{idx}.",
|
| 632 |
+
)
|
| 633 |
+
speaker_panels.append(panel)
|
| 634 |
+
speaker_refs.append(speaker_ref)
|
| 635 |
+
speaker_prompts.append(speaker_prompt)
|
| 636 |
+
|
| 637 |
+
gr.Markdown("### Multi-turn Dialogue")
|
| 638 |
+
dialogue_text = gr.Textbox(
|
| 639 |
+
label="Dialogue Text",
|
| 640 |
+
lines=12,
|
| 641 |
+
placeholder=(
|
| 642 |
+
"Use explicit tags in a single box, e.g.\n"
|
| 643 |
+
"[S1] Hello.\n"
|
| 644 |
+
"[S2] Hi, how are you?\n"
|
| 645 |
+
"[S1] Great, let's continue."
|
| 646 |
+
),
|
| 647 |
+
)
|
| 648 |
+
gr.Markdown(
|
| 649 |
+
"Without any reference audio, the model runs in generation mode. "
|
| 650 |
+
"Once any reference audio is provided, the model switches to voice-clone continuation mode."
|
| 651 |
+
)
|
| 652 |
+
|
| 653 |
+
with gr.Accordion("Sampling Parameters (Audio)", open=True):
|
| 654 |
+
gr.Markdown(
|
| 655 |
+
"- `text_normalize`: Normalize input text (**recommended to always enable**).\n"
|
| 656 |
+
"- `sample_rate_normalize`: Resample prompt audios to the lowest sample rate before encoding "
|
| 657 |
+
"(**recommended when using 2 or more speakers**)."
|
| 658 |
+
)
|
| 659 |
+
text_normalize = gr.Checkbox(
|
| 660 |
+
value=True,
|
| 661 |
+
label="text_normalize",
|
| 662 |
+
)
|
| 663 |
+
sample_rate_normalize = gr.Checkbox(
|
| 664 |
+
value=False,
|
| 665 |
+
label="sample_rate_normalize",
|
| 666 |
+
)
|
| 667 |
+
temperature = gr.Slider(
|
| 668 |
+
minimum=0.1,
|
| 669 |
+
maximum=3.0,
|
| 670 |
+
step=0.05,
|
| 671 |
+
value=1.1,
|
| 672 |
+
label="temperature",
|
| 673 |
+
)
|
| 674 |
+
top_p = gr.Slider(
|
| 675 |
+
minimum=0.1,
|
| 676 |
+
maximum=1.0,
|
| 677 |
+
step=0.01,
|
| 678 |
+
value=0.9,
|
| 679 |
+
label="top_p",
|
| 680 |
+
)
|
| 681 |
+
top_k = gr.Slider(
|
| 682 |
+
minimum=1,
|
| 683 |
+
maximum=200,
|
| 684 |
+
step=1,
|
| 685 |
+
value=50,
|
| 686 |
+
label="top_k",
|
| 687 |
+
)
|
| 688 |
+
repetition_penalty = gr.Slider(
|
| 689 |
+
minimum=0.8,
|
| 690 |
+
maximum=2.0,
|
| 691 |
+
step=0.05,
|
| 692 |
+
value=1.1,
|
| 693 |
+
label="repetition_penalty",
|
| 694 |
+
)
|
| 695 |
+
max_new_tokens = gr.Slider(
|
| 696 |
+
minimum=256,
|
| 697 |
+
maximum=8192,
|
| 698 |
+
step=128,
|
| 699 |
+
value=DEFAULT_MAX_NEW_TOKENS,
|
| 700 |
+
label="max_new_tokens",
|
| 701 |
+
)
|
| 702 |
+
|
| 703 |
+
run_btn = gr.Button("Generate Dialogue Audio", variant="primary", elem_id="run-btn")
|
| 704 |
+
|
| 705 |
+
with gr.Column(scale=2, elem_id="output_panel"):
|
| 706 |
+
output_audio = gr.Audio(label="Output Audio", type="numpy", elem_id="output_audio")
|
| 707 |
+
gr.HTML("", elem_id="output_audio_spacer")
|
| 708 |
+
status = gr.Textbox(label="Status", lines=4, interactive=False, elem_id="output_status")
|
| 709 |
+
preset_examples = gr.Dataframe(
|
| 710 |
+
headers=["Field", "Value (click any row to fill inputs)"],
|
| 711 |
+
value=PRESET_TABLE_ROWS,
|
| 712 |
+
datatype=["str", "str"],
|
| 713 |
+
row_count=(len(PRESET_TABLE_ROWS), "fixed"),
|
| 714 |
+
col_count=(2, "fixed"),
|
| 715 |
+
interactive=False,
|
| 716 |
+
wrap=True,
|
| 717 |
+
label="Preset Examples",
|
| 718 |
+
)
|
| 719 |
+
|
| 720 |
+
speaker_count.change(
|
| 721 |
+
fn=update_speaker_panels,
|
| 722 |
+
inputs=[speaker_count],
|
| 723 |
+
outputs=speaker_panels,
|
| 724 |
+
)
|
| 725 |
+
preset_examples.select(
|
| 726 |
+
fn=apply_preset_selection,
|
| 727 |
+
outputs=[
|
| 728 |
+
speaker_count,
|
| 729 |
+
speaker_refs[0],
|
| 730 |
+
speaker_prompts[0],
|
| 731 |
+
speaker_refs[1],
|
| 732 |
+
speaker_prompts[1],
|
| 733 |
+
dialogue_text,
|
| 734 |
+
*speaker_panels,
|
| 735 |
+
],
|
| 736 |
+
)
|
| 737 |
+
|
| 738 |
+
run_btn.click(
|
| 739 |
+
fn=lambda speaker_count, *inputs: run_inference(
|
| 740 |
+
speaker_count,
|
| 741 |
+
*inputs,
|
| 742 |
+
args.model_path,
|
| 743 |
+
args.codec_path,
|
| 744 |
+
args.device,
|
| 745 |
+
args.attn_implementation,
|
| 746 |
+
),
|
| 747 |
+
inputs=[
|
| 748 |
+
speaker_count,
|
| 749 |
+
*speaker_refs,
|
| 750 |
+
*speaker_prompts,
|
| 751 |
+
dialogue_text,
|
| 752 |
+
text_normalize,
|
| 753 |
+
sample_rate_normalize,
|
| 754 |
+
temperature,
|
| 755 |
+
top_p,
|
| 756 |
+
top_k,
|
| 757 |
+
repetition_penalty,
|
| 758 |
+
max_new_tokens,
|
| 759 |
+
],
|
| 760 |
+
outputs=[output_audio, status],
|
| 761 |
+
)
|
| 762 |
+
return demo
|
| 763 |
+
|
| 764 |
+
|
| 765 |
+
def main() -> None:
|
| 766 |
+
parser = argparse.ArgumentParser(description="MOSS-TTSD Gradio Demo")
|
| 767 |
+
parser.add_argument("--model_path", type=str, default=MODEL_PATH)
|
| 768 |
+
parser.add_argument("--codec_path", type=str, default=CODEC_MODEL_PATH)
|
| 769 |
+
parser.add_argument("--device", type=str, default="cuda:0")
|
| 770 |
+
parser.add_argument("--attn_implementation", type=str, default=DEFAULT_ATTN_IMPLEMENTATION)
|
| 771 |
+
parser.add_argument("--host", type=str, default="0.0.0.0")
|
| 772 |
+
parser.add_argument("--port", type=int, default=7863)
|
| 773 |
+
parser.add_argument("--share", action="store_true")
|
| 774 |
+
args = parser.parse_args()
|
| 775 |
+
|
| 776 |
+
runtime_device = torch.device(args.device if torch.cuda.is_available() else "cpu")
|
| 777 |
+
runtime_dtype = torch.bfloat16 if runtime_device.type == "cuda" else torch.float32
|
| 778 |
+
args.attn_implementation = resolve_attn_implementation(
|
| 779 |
+
requested=args.attn_implementation,
|
| 780 |
+
device=runtime_device,
|
| 781 |
+
dtype=runtime_dtype,
|
| 782 |
+
) or "none"
|
| 783 |
+
print(f"[INFO] Using attn_implementation={args.attn_implementation}", flush=True)
|
| 784 |
+
|
| 785 |
+
preload_started_at = time.monotonic()
|
| 786 |
+
print(
|
| 787 |
+
f"[Startup] Preloading backend: model={args.model_path}, codec={args.codec_path}, "
|
| 788 |
+
f"device={args.device}, attn={args.attn_implementation}",
|
| 789 |
+
flush=True,
|
| 790 |
+
)
|
| 791 |
+
load_backend(
|
| 792 |
+
model_path=args.model_path,
|
| 793 |
+
codec_path=args.codec_path,
|
| 794 |
+
device_str=args.device,
|
| 795 |
+
attn_implementation=args.attn_implementation,
|
| 796 |
+
)
|
| 797 |
+
print(
|
| 798 |
+
f"[Startup] Backend preload finished in {time.monotonic() - preload_started_at:.2f}s",
|
| 799 |
+
flush=True,
|
| 800 |
+
)
|
| 801 |
+
|
| 802 |
+
demo = build_demo(args)
|
| 803 |
+
demo.queue(default_concurrency_limit=2).launch(
|
| 804 |
+
server_name=args.host,
|
| 805 |
+
server_port=args.port,
|
| 806 |
+
share=args.share,
|
| 807 |
+
)
|
| 808 |
+
|
| 809 |
+
|
| 810 |
+
if __name__ == "__main__":
|
| 811 |
+
main()
|
clis/moss_voice_generator_app.py
ADDED
|
@@ -0,0 +1,410 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import functools
|
| 3 |
+
import importlib.util
|
| 4 |
+
import json
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
import re
|
| 7 |
+
import time
|
| 8 |
+
|
| 9 |
+
import gradio as gr
|
| 10 |
+
import numpy as np
|
| 11 |
+
import torch
|
| 12 |
+
from transformers import AutoModel, AutoProcessor
|
| 13 |
+
|
| 14 |
+
# Disable the broken cuDNN SDPA backend
|
| 15 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 16 |
+
# Keep these enabled as fallbacks
|
| 17 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 18 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 19 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 20 |
+
|
| 21 |
+
MODEL_PATH = "OpenMOSS-Team/MOSS-VoiceGenerator"
|
| 22 |
+
DEFAULT_ATTN_IMPLEMENTATION = "auto"
|
| 23 |
+
DEFAULT_MAX_NEW_TOKENS = 4096
|
| 24 |
+
EXAMPLE_TEXTS_JSONL_PATH = (
|
| 25 |
+
Path(__file__).resolve().parent.parent / "assets" / "text" / "moss_voice_generator_example_texts.jsonl"
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def _parse_example_id(example_id: str) -> tuple[str, int] | None:
|
| 30 |
+
matched = re.fullmatch(r"(zh|en)/(\d+)", (example_id or "").strip())
|
| 31 |
+
if matched is None:
|
| 32 |
+
return None
|
| 33 |
+
return matched.group(1), int(matched.group(2))
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def build_example_rows() -> list[tuple[str, str, str]]:
|
| 37 |
+
rows: list[tuple[str, int, str, str]] = []
|
| 38 |
+
with open(EXAMPLE_TEXTS_JSONL_PATH, "r", encoding="utf-8") as f:
|
| 39 |
+
for line in f:
|
| 40 |
+
if not line.strip():
|
| 41 |
+
continue
|
| 42 |
+
sample = json.loads(line)
|
| 43 |
+
parsed = _parse_example_id(sample.get("id", ""))
|
| 44 |
+
if parsed is None:
|
| 45 |
+
continue
|
| 46 |
+
|
| 47 |
+
language, index = parsed
|
| 48 |
+
instruction = str(sample.get("instruction", "")).strip()
|
| 49 |
+
text = str(sample.get("text", "")).strip()
|
| 50 |
+
rows.append((language, index, instruction, text))
|
| 51 |
+
|
| 52 |
+
language_order = {"zh": 0, "en": 1}
|
| 53 |
+
rows.sort(key=lambda item: (language_order.get(item[0], 99), item[1]))
|
| 54 |
+
return [(f"{language}/{index}", instruction, text) for language, index, instruction, text in rows]
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
EXAMPLE_ROWS = build_example_rows()
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def apply_example_selection(evt: gr.SelectData):
|
| 61 |
+
if evt is None or evt.index is None:
|
| 62 |
+
return gr.update(), gr.update()
|
| 63 |
+
|
| 64 |
+
if isinstance(evt.index, (tuple, list)):
|
| 65 |
+
row_idx = int(evt.index[0])
|
| 66 |
+
else:
|
| 67 |
+
row_idx = int(evt.index)
|
| 68 |
+
|
| 69 |
+
if row_idx < 0 or row_idx >= len(EXAMPLE_ROWS):
|
| 70 |
+
return gr.update(), gr.update()
|
| 71 |
+
|
| 72 |
+
_, instruction_value, text_value = EXAMPLE_ROWS[row_idx]
|
| 73 |
+
return instruction_value, text_value
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def resolve_attn_implementation(requested: str, device: torch.device, dtype: torch.dtype) -> str | None:
|
| 77 |
+
requested_norm = (requested or "").strip().lower()
|
| 78 |
+
|
| 79 |
+
if requested_norm in {"none"}:
|
| 80 |
+
return None
|
| 81 |
+
|
| 82 |
+
if requested_norm not in {"", "auto"}:
|
| 83 |
+
return requested
|
| 84 |
+
|
| 85 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 86 |
+
if (
|
| 87 |
+
device.type == "cuda"
|
| 88 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 89 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 90 |
+
):
|
| 91 |
+
major, _ = torch.cuda.get_device_capability(device)
|
| 92 |
+
if major >= 8:
|
| 93 |
+
return "flash_attention_2"
|
| 94 |
+
|
| 95 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 96 |
+
if device.type == "cuda":
|
| 97 |
+
return "sdpa"
|
| 98 |
+
|
| 99 |
+
# CPU fallback.
|
| 100 |
+
return "eager"
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
@functools.lru_cache(maxsize=1)
|
| 104 |
+
def load_backend(model_path: str, device_str: str, attn_implementation: str):
|
| 105 |
+
device = torch.device(device_str if torch.cuda.is_available() else "cpu")
|
| 106 |
+
dtype = torch.bfloat16 if device.type == "cuda" else torch.float32
|
| 107 |
+
resolved_attn_implementation = resolve_attn_implementation(
|
| 108 |
+
requested=attn_implementation,
|
| 109 |
+
device=device,
|
| 110 |
+
dtype=dtype,
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
processor = AutoProcessor.from_pretrained(
|
| 114 |
+
model_path,
|
| 115 |
+
trust_remote_code=True,
|
| 116 |
+
normalize_inputs=True,
|
| 117 |
+
)
|
| 118 |
+
if hasattr(processor, "audio_tokenizer"):
|
| 119 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 120 |
+
|
| 121 |
+
model_kwargs = {
|
| 122 |
+
"trust_remote_code": True,
|
| 123 |
+
"torch_dtype": dtype,
|
| 124 |
+
}
|
| 125 |
+
if resolved_attn_implementation:
|
| 126 |
+
model_kwargs["attn_implementation"] = resolved_attn_implementation
|
| 127 |
+
|
| 128 |
+
model = AutoModel.from_pretrained(model_path, **model_kwargs).to(device)
|
| 129 |
+
model.eval()
|
| 130 |
+
|
| 131 |
+
sample_rate = int(getattr(processor.model_config, "sampling_rate", 24000))
|
| 132 |
+
return model, processor, device, sample_rate
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def build_conversation(text: str, instruction: str, processor):
|
| 136 |
+
text = (text or "").strip()
|
| 137 |
+
instruction = (instruction or "").strip()
|
| 138 |
+
if not text:
|
| 139 |
+
raise ValueError("Please enter text to synthesize.")
|
| 140 |
+
if not instruction:
|
| 141 |
+
raise ValueError("Please enter a voice instruction.")
|
| 142 |
+
|
| 143 |
+
return [[processor.build_user_message(text=text, instruction=instruction)]]
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def run_inference(
|
| 147 |
+
text: str,
|
| 148 |
+
instruction: str,
|
| 149 |
+
temperature: float,
|
| 150 |
+
top_p: float,
|
| 151 |
+
top_k: int,
|
| 152 |
+
repetition_penalty: float,
|
| 153 |
+
max_new_tokens: int,
|
| 154 |
+
model_path: str,
|
| 155 |
+
device: str,
|
| 156 |
+
attn_implementation: str,
|
| 157 |
+
):
|
| 158 |
+
started_at = time.monotonic()
|
| 159 |
+
model, processor, torch_device, sample_rate = load_backend(
|
| 160 |
+
model_path=model_path,
|
| 161 |
+
device_str=device,
|
| 162 |
+
attn_implementation=attn_implementation,
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
conversations = build_conversation(
|
| 166 |
+
text=text,
|
| 167 |
+
instruction=instruction,
|
| 168 |
+
processor=processor,
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
batch = processor(conversations, mode="generation")
|
| 172 |
+
input_ids = batch["input_ids"].to(torch_device)
|
| 173 |
+
attention_mask = batch["attention_mask"].to(torch_device)
|
| 174 |
+
|
| 175 |
+
with torch.no_grad():
|
| 176 |
+
outputs = model.generate(
|
| 177 |
+
input_ids=input_ids,
|
| 178 |
+
attention_mask=attention_mask,
|
| 179 |
+
max_new_tokens=int(max_new_tokens),
|
| 180 |
+
audio_temperature=float(temperature),
|
| 181 |
+
audio_top_p=float(top_p),
|
| 182 |
+
audio_top_k=int(top_k),
|
| 183 |
+
audio_repetition_penalty=float(repetition_penalty),
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
messages = processor.decode(outputs)
|
| 187 |
+
if not messages or messages[0] is None:
|
| 188 |
+
raise RuntimeError("The model did not return a decodable audio result.")
|
| 189 |
+
|
| 190 |
+
audio = messages[0].audio_codes_list[0]
|
| 191 |
+
if isinstance(audio, torch.Tensor):
|
| 192 |
+
audio_np = audio.detach().float().cpu().numpy()
|
| 193 |
+
else:
|
| 194 |
+
audio_np = np.asarray(audio, dtype=np.float32)
|
| 195 |
+
|
| 196 |
+
if audio_np.ndim > 1:
|
| 197 |
+
audio_np = audio_np.reshape(-1)
|
| 198 |
+
audio_np = audio_np.astype(np.float32, copy=False)
|
| 199 |
+
|
| 200 |
+
elapsed = time.monotonic() - started_at
|
| 201 |
+
status = (
|
| 202 |
+
f"Done | elapsed: {elapsed:.2f}s | "
|
| 203 |
+
f"max_new_tokens={int(max_new_tokens)}, "
|
| 204 |
+
f"audio_temperature={float(temperature):.2f}, audio_top_p={float(top_p):.2f}, "
|
| 205 |
+
f"audio_top_k={int(top_k)}, audio_repetition_penalty={float(repetition_penalty):.2f}"
|
| 206 |
+
)
|
| 207 |
+
return (sample_rate, audio_np), status
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def build_demo(args: argparse.Namespace):
|
| 211 |
+
custom_css = """
|
| 212 |
+
:root {
|
| 213 |
+
--bg: #f6f7f8;
|
| 214 |
+
--panel: #ffffff;
|
| 215 |
+
--ink: #111418;
|
| 216 |
+
--muted: #4d5562;
|
| 217 |
+
--line: #e5e7eb;
|
| 218 |
+
--accent: #0f766e;
|
| 219 |
+
}
|
| 220 |
+
.gradio-container {
|
| 221 |
+
background: linear-gradient(180deg, #f7f8fa 0%, #f3f5f7 100%);
|
| 222 |
+
color: var(--ink);
|
| 223 |
+
}
|
| 224 |
+
.app-card {
|
| 225 |
+
border: 1px solid var(--line);
|
| 226 |
+
border-radius: 16px;
|
| 227 |
+
background: var(--panel);
|
| 228 |
+
padding: 14px;
|
| 229 |
+
}
|
| 230 |
+
.app-title {
|
| 231 |
+
font-size: 22px;
|
| 232 |
+
font-weight: 700;
|
| 233 |
+
margin-bottom: 6px;
|
| 234 |
+
letter-spacing: 0.2px;
|
| 235 |
+
}
|
| 236 |
+
.app-subtitle {
|
| 237 |
+
color: var(--muted);
|
| 238 |
+
font-size: 14px;
|
| 239 |
+
margin-bottom: 8px;
|
| 240 |
+
}
|
| 241 |
+
#output_audio {
|
| 242 |
+
padding-bottom: 12px;
|
| 243 |
+
margin-bottom: 8px;
|
| 244 |
+
overflow: hidden !important;
|
| 245 |
+
}
|
| 246 |
+
#output_audio > .wrap {
|
| 247 |
+
overflow: hidden !important;
|
| 248 |
+
}
|
| 249 |
+
#output_audio audio {
|
| 250 |
+
margin-bottom: 6px;
|
| 251 |
+
}
|
| 252 |
+
#run-btn {
|
| 253 |
+
background: var(--accent);
|
| 254 |
+
border: none;
|
| 255 |
+
}
|
| 256 |
+
"""
|
| 257 |
+
|
| 258 |
+
with gr.Blocks(title="MOSS-VoiceGenerator Demo", css=custom_css) as demo:
|
| 259 |
+
gr.Markdown(
|
| 260 |
+
"""
|
| 261 |
+
<div class="app-card">
|
| 262 |
+
<div class="app-title">MOSS-VoiceGenerator</div>
|
| 263 |
+
<div class="app-subtitle">Design expressive voices from instruction + text without reference audio.</div>
|
| 264 |
+
</div>
|
| 265 |
+
"""
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
with gr.Row(equal_height=False):
|
| 269 |
+
with gr.Column(scale=3):
|
| 270 |
+
instruction = gr.Textbox(
|
| 271 |
+
label="Voice Instruction",
|
| 272 |
+
lines=5,
|
| 273 |
+
placeholder="Example: Warm, gentle female narrator voice with calm pacing and clear articulation.",
|
| 274 |
+
)
|
| 275 |
+
text = gr.Textbox(
|
| 276 |
+
label="Text",
|
| 277 |
+
lines=8,
|
| 278 |
+
placeholder="Enter the text content to synthesize with the instruction-defined voice.",
|
| 279 |
+
)
|
| 280 |
+
|
| 281 |
+
with gr.Accordion("Sampling Parameters (Audio)", open=True):
|
| 282 |
+
temperature = gr.Slider(
|
| 283 |
+
minimum=0.1,
|
| 284 |
+
maximum=3.0,
|
| 285 |
+
step=0.05,
|
| 286 |
+
value=1.5,
|
| 287 |
+
label="temperature",
|
| 288 |
+
)
|
| 289 |
+
top_p = gr.Slider(
|
| 290 |
+
minimum=0.1,
|
| 291 |
+
maximum=1.0,
|
| 292 |
+
step=0.01,
|
| 293 |
+
value=0.6,
|
| 294 |
+
label="top_p",
|
| 295 |
+
)
|
| 296 |
+
top_k = gr.Slider(
|
| 297 |
+
minimum=1,
|
| 298 |
+
maximum=200,
|
| 299 |
+
step=1,
|
| 300 |
+
value=50,
|
| 301 |
+
label="top_k",
|
| 302 |
+
)
|
| 303 |
+
repetition_penalty = gr.Slider(
|
| 304 |
+
minimum=0.8,
|
| 305 |
+
maximum=2.0,
|
| 306 |
+
step=0.05,
|
| 307 |
+
value=1.1,
|
| 308 |
+
label="repetition_penalty",
|
| 309 |
+
)
|
| 310 |
+
max_new_tokens = gr.Slider(
|
| 311 |
+
minimum=256,
|
| 312 |
+
maximum=8192,
|
| 313 |
+
step=128,
|
| 314 |
+
value=DEFAULT_MAX_NEW_TOKENS,
|
| 315 |
+
label="max_new_tokens",
|
| 316 |
+
)
|
| 317 |
+
|
| 318 |
+
run_btn = gr.Button("Generate Voice", variant="primary", elem_id="run-btn")
|
| 319 |
+
|
| 320 |
+
with gr.Column(scale=2):
|
| 321 |
+
output_audio = gr.Audio(label="Output Audio", type="numpy", elem_id="output_audio")
|
| 322 |
+
status = gr.Textbox(label="Status", lines=4, interactive=False)
|
| 323 |
+
examples_table = gr.Dataframe(
|
| 324 |
+
headers=["Voice Instruction", "Example Text"],
|
| 325 |
+
value=[[example_instruction, example_text] for _, example_instruction, example_text in EXAMPLE_ROWS],
|
| 326 |
+
datatype=["str", "str"],
|
| 327 |
+
row_count=(len(EXAMPLE_ROWS), "fixed"),
|
| 328 |
+
col_count=(2, "fixed"),
|
| 329 |
+
interactive=False,
|
| 330 |
+
wrap=True,
|
| 331 |
+
label="Examples (click a row to fill inputs)",
|
| 332 |
+
)
|
| 333 |
+
|
| 334 |
+
examples_table.select(
|
| 335 |
+
fn=apply_example_selection,
|
| 336 |
+
inputs=[],
|
| 337 |
+
outputs=[instruction, text],
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
+
run_btn.click(
|
| 341 |
+
fn=lambda text, instruction, temperature, top_p, top_k, repetition_penalty, max_new_tokens: run_inference(
|
| 342 |
+
text=text,
|
| 343 |
+
instruction=instruction,
|
| 344 |
+
temperature=temperature,
|
| 345 |
+
top_p=top_p,
|
| 346 |
+
top_k=top_k,
|
| 347 |
+
repetition_penalty=repetition_penalty,
|
| 348 |
+
max_new_tokens=max_new_tokens,
|
| 349 |
+
model_path=args.model_path,
|
| 350 |
+
device=args.device,
|
| 351 |
+
attn_implementation=args.attn_implementation,
|
| 352 |
+
),
|
| 353 |
+
inputs=[
|
| 354 |
+
text,
|
| 355 |
+
instruction,
|
| 356 |
+
temperature,
|
| 357 |
+
top_p,
|
| 358 |
+
top_k,
|
| 359 |
+
repetition_penalty,
|
| 360 |
+
max_new_tokens,
|
| 361 |
+
],
|
| 362 |
+
outputs=[output_audio, status],
|
| 363 |
+
)
|
| 364 |
+
return demo
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
def main():
|
| 368 |
+
parser = argparse.ArgumentParser(description="MOSS-VoiceGenerator Gradio Demo")
|
| 369 |
+
parser.add_argument("--model_path", type=str, default=MODEL_PATH)
|
| 370 |
+
parser.add_argument("--device", type=str, default="cuda:0")
|
| 371 |
+
parser.add_argument("--attn_implementation", type=str, default=DEFAULT_ATTN_IMPLEMENTATION)
|
| 372 |
+
parser.add_argument("--host", type=str, default="0.0.0.0")
|
| 373 |
+
parser.add_argument("--port", type=int, default=7862)
|
| 374 |
+
parser.add_argument("--share", action="store_true")
|
| 375 |
+
args = parser.parse_args()
|
| 376 |
+
|
| 377 |
+
runtime_device = torch.device(args.device if torch.cuda.is_available() else "cpu")
|
| 378 |
+
runtime_dtype = torch.bfloat16 if runtime_device.type == "cuda" else torch.float32
|
| 379 |
+
args.attn_implementation = resolve_attn_implementation(
|
| 380 |
+
requested=args.attn_implementation,
|
| 381 |
+
device=runtime_device,
|
| 382 |
+
dtype=runtime_dtype,
|
| 383 |
+
) or "none"
|
| 384 |
+
print(f"[INFO] Using attn_implementation={args.attn_implementation}", flush=True)
|
| 385 |
+
|
| 386 |
+
preload_started_at = time.monotonic()
|
| 387 |
+
print(
|
| 388 |
+
f"[Startup] Preloading backend: model={args.model_path}, device={args.device}, attn={args.attn_implementation}",
|
| 389 |
+
flush=True,
|
| 390 |
+
)
|
| 391 |
+
load_backend(
|
| 392 |
+
model_path=args.model_path,
|
| 393 |
+
device_str=args.device,
|
| 394 |
+
attn_implementation=args.attn_implementation,
|
| 395 |
+
)
|
| 396 |
+
print(
|
| 397 |
+
f"[Startup] Backend preload finished in {time.monotonic() - preload_started_at:.2f}s",
|
| 398 |
+
flush=True,
|
| 399 |
+
)
|
| 400 |
+
|
| 401 |
+
demo = build_demo(args)
|
| 402 |
+
demo.queue(max_size=16, default_concurrency_limit=1).launch(
|
| 403 |
+
server_name=args.host,
|
| 404 |
+
server_port=args.port,
|
| 405 |
+
share=args.share,
|
| 406 |
+
)
|
| 407 |
+
|
| 408 |
+
|
| 409 |
+
if __name__ == "__main__":
|
| 410 |
+
main()
|
configs/llama_cpp/cpu-only.yaml
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS-Delay — llama.cpp Backend (CPU-only)
|
| 2 |
+
#
|
| 3 |
+
# For machines without GPU. All computation runs on CPU.
|
| 4 |
+
# Performance will be slower but no CUDA/GPU drivers required.
|
| 5 |
+
#
|
| 6 |
+
# Installation:
|
| 7 |
+
# pip install -e ".[llama-cpp]"
|
| 8 |
+
# pip install onnxruntime # CPU-only ONNX Runtime
|
| 9 |
+
|
| 10 |
+
# ── Model paths ──────────────────────────────────────────────────────────────
|
| 11 |
+
|
| 12 |
+
backbone_gguf: weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf
|
| 13 |
+
embedding_dir: weights/MOSS-TTS-GGUF/embeddings
|
| 14 |
+
lm_head_dir: weights/MOSS-TTS-GGUF/lm_heads
|
| 15 |
+
tokenizer_dir: weights/MOSS-TTS-GGUF/tokenizer
|
| 16 |
+
|
| 17 |
+
# ── Audio tokenizer ──────────────────────────────────────────────────────────
|
| 18 |
+
|
| 19 |
+
audio_backend: onnx
|
| 20 |
+
audio_encoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx
|
| 21 |
+
audio_decoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx
|
| 22 |
+
|
| 23 |
+
# ── LM heads backend ────────────────────────────────────────────────────────
|
| 24 |
+
|
| 25 |
+
heads_backend: numpy
|
| 26 |
+
|
| 27 |
+
# ── Runtime settings ─────────────────────────────────────────────────────────
|
| 28 |
+
|
| 29 |
+
n_ctx: 4096
|
| 30 |
+
n_batch: 256
|
| 31 |
+
n_threads: 8
|
| 32 |
+
n_gpu_layers: 0
|
| 33 |
+
max_new_tokens: 3072
|
| 34 |
+
use_gpu_audio: false
|
| 35 |
+
|
| 36 |
+
# ── Sampling parameters ──────────────────────────────────────────────────────
|
| 37 |
+
|
| 38 |
+
text_temperature: 1.5
|
| 39 |
+
text_top_p: 1.0
|
| 40 |
+
text_top_k: 50
|
| 41 |
+
|
| 42 |
+
audio_temperature: 1.7
|
| 43 |
+
audio_top_p: 0.8
|
| 44 |
+
audio_top_k: 25
|
| 45 |
+
audio_repetition_penalty: 1.0
|
configs/llama_cpp/default.yaml
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS-Delay — llama.cpp Backend (ONNX audio, default)
|
| 2 |
+
#
|
| 3 |
+
# Torch-free minimal installation:
|
| 4 |
+
# pip install -e ".[llama-cpp-onnx]"
|
| 5 |
+
#
|
| 6 |
+
# Download pre-quantized weights:
|
| 7 |
+
# huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF
|
| 8 |
+
#
|
| 9 |
+
# Download ONNX audio tokenizer:
|
| 10 |
+
# huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX
|
| 11 |
+
#
|
| 12 |
+
# All paths are relative to the project root.
|
| 13 |
+
|
| 14 |
+
# ── Model paths ──────────────────────────────────────────────────────────────
|
| 15 |
+
|
| 16 |
+
# Pre-quantized GGUF backbone (Q4_K_M)
|
| 17 |
+
backbone_gguf: weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf
|
| 18 |
+
|
| 19 |
+
# Pre-extracted .npy embedding tables and LM head weights
|
| 20 |
+
embedding_dir: weights/MOSS-TTS-GGUF/embeddings
|
| 21 |
+
lm_head_dir: weights/MOSS-TTS-GGUF/lm_heads
|
| 22 |
+
|
| 23 |
+
# Tokenizer directory (ships with GGUF repo)
|
| 24 |
+
tokenizer_dir: weights/MOSS-TTS-GGUF/tokenizer
|
| 25 |
+
|
| 26 |
+
# ── Audio tokenizer ──────────────────────────────────────────────────────────
|
| 27 |
+
# We provide ONNX models only. For TensorRT, build engines yourself
|
| 28 |
+
# (see moss_audio_tokenizer/trt/build_engine.sh).
|
| 29 |
+
|
| 30 |
+
audio_backend: onnx
|
| 31 |
+
audio_encoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx
|
| 32 |
+
audio_decoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx
|
| 33 |
+
|
| 34 |
+
# ── LM heads backend ────────────────────────────────────────────────────────
|
| 35 |
+
# "auto" = use torch if available, else numpy
|
| 36 |
+
# "numpy" = force numpy (torch-free)
|
| 37 |
+
# "torch" = force torch (error if unavailable)
|
| 38 |
+
|
| 39 |
+
heads_backend: auto
|
| 40 |
+
|
| 41 |
+
# ── Runtime settings ─────────────────────────────────────────────────────────
|
| 42 |
+
|
| 43 |
+
n_ctx: 4096
|
| 44 |
+
n_batch: 512
|
| 45 |
+
n_threads: 4
|
| 46 |
+
n_gpu_layers: -1
|
| 47 |
+
max_new_tokens: 3072
|
| 48 |
+
use_gpu_audio: true
|
| 49 |
+
|
| 50 |
+
# ── KV cache / attention ────────────────────────────────────────────────────
|
| 51 |
+
# Quantized KV cache saves VRAM. q8_0 is nearly lossless; q4_0 needs eval.
|
| 52 |
+
# Options: f32, f16, bf16, q8_0, q5_0, q4_0
|
| 53 |
+
# kv_cache_type_k: f16
|
| 54 |
+
# kv_cache_type_v: f16
|
| 55 |
+
|
| 56 |
+
# Flash attention reduces peak VRAM during prefill. "auto" lets llama.cpp
|
| 57 |
+
# decide; "enabled" forces it on (recommended when CUDA is available).
|
| 58 |
+
# Options: auto, enabled, disabled
|
| 59 |
+
# flash_attn: auto
|
| 60 |
+
|
| 61 |
+
# ── Sampling parameters ──────────────────────────────────────────────────────
|
| 62 |
+
|
| 63 |
+
text_temperature: 1.5
|
| 64 |
+
text_top_p: 1.0
|
| 65 |
+
text_top_k: 50
|
| 66 |
+
|
| 67 |
+
audio_temperature: 1.7
|
| 68 |
+
audio_top_p: 0.8
|
| 69 |
+
audio_top_k: 25
|
| 70 |
+
audio_repetition_penalty: 1.0
|
configs/llama_cpp/trt-8gb.yaml
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS-Delay — 8 GB GPU config (low-memory mode)
|
| 2 |
+
#
|
| 3 |
+
# Uses staged loading (low_memory: true) to load/unload components per stage.
|
| 4 |
+
# Peak VRAM ≈ 5.6 GB (backbone stage), well within 8 GB.
|
| 5 |
+
#
|
| 6 |
+
# Run with:
|
| 7 |
+
# PYTHONPATH=. python -m moss_tts_delay.llama_cpp \
|
| 8 |
+
# --config configs/llama_cpp/trt-8gb.yaml --text "Hello!" --profile
|
| 9 |
+
#
|
| 10 |
+
#
|
| 11 |
+
# Measured on H100 (transferable to any 8 GB+ GPU):
|
| 12 |
+
# Backbone (Q4_K_M, 36 layers, KV@4096, flash_attn) ≈ 5.6 GB
|
| 13 |
+
# TRT encoder alone ≈ 5.3 GB
|
| 14 |
+
# TRT decoder alone ≈ 5.3 GB
|
| 15 |
+
# Peak (staged) ≈ 5.6 GB
|
| 16 |
+
|
| 17 |
+
# ── Model paths ──────────────────────────────────────────────────────────────
|
| 18 |
+
|
| 19 |
+
backbone_gguf: weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf
|
| 20 |
+
embedding_dir: weights/MOSS-TTS-GGUF/embeddings
|
| 21 |
+
lm_head_dir: weights/MOSS-TTS-GGUF/lm_heads
|
| 22 |
+
tokenizer_dir: weights/MOSS-TTS-GGUF/tokenizer
|
| 23 |
+
|
| 24 |
+
# ── Audio tokenizer ──────────────────────────────────────────────────────────
|
| 25 |
+
# Switch to onnx + audio_encoder_onnx/audio_decoder_onnx if TRT not available.
|
| 26 |
+
|
| 27 |
+
audio_backend: trt
|
| 28 |
+
audio_encoder_trt: weights/MOSS-Audio-Tokenizer-TRT/encoder.engine
|
| 29 |
+
audio_decoder_trt: weights/MOSS-Audio-Tokenizer-TRT/decoder.engine
|
| 30 |
+
|
| 31 |
+
# ── LM heads backend ────────────────────────────────────────────────────────
|
| 32 |
+
# numpy = 0 GPU, ~3 GB RAM. Mandatory for 8 GB to keep backbone headroom.
|
| 33 |
+
|
| 34 |
+
heads_backend: numpy
|
| 35 |
+
|
| 36 |
+
# ── Runtime settings ─────────────────────────────────────────────────────────
|
| 37 |
+
# n_ctx=4096 keeps KV cache at ~576 MB (fp16). Enough for most single
|
| 38 |
+
# utterances. If you hit context-length errors, increase to 6144/8192
|
| 39 |
+
# but check that backbone still fits (KV grows ~144 MB per 1024 tokens).
|
| 40 |
+
|
| 41 |
+
n_ctx: 4096
|
| 42 |
+
n_batch: 512
|
| 43 |
+
n_threads: 4
|
| 44 |
+
n_gpu_layers: -1
|
| 45 |
+
max_new_tokens: 3072
|
| 46 |
+
use_gpu_audio: true
|
| 47 |
+
low_memory: true
|
| 48 |
+
|
| 49 |
+
# ── KV cache / attention ────────────────────────────────────────────────────
|
| 50 |
+
# Quantized KV cache saves ~0.45 GB (q8_0) or ~0.72 GB (q4_0) at n_ctx=4096.
|
| 51 |
+
# Options: f32, f16, bf16, q8_0, q5_0, q4_0
|
| 52 |
+
kv_cache_type_k: f16
|
| 53 |
+
kv_cache_type_v: f16
|
| 54 |
+
|
| 55 |
+
# Flash attention reduces peak VRAM during prefill (0.5–2 GB for long prompts).
|
| 56 |
+
# Options: auto, enabled, disabled
|
| 57 |
+
flash_attn: enabled
|
| 58 |
+
|
| 59 |
+
# ── Sampling parameters ──────────────────────────────────────────────────────
|
| 60 |
+
|
| 61 |
+
text_temperature: 1.5
|
| 62 |
+
text_top_p: 1.0
|
| 63 |
+
text_top_k: 50
|
| 64 |
+
|
| 65 |
+
audio_temperature: 1.7
|
| 66 |
+
audio_top_p: 0.8
|
| 67 |
+
audio_top_k: 25
|
| 68 |
+
audio_repetition_penalty: 1.0
|
configs/llama_cpp/trt.yaml
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS-Delay — llama.cpp Backend (TensorRT audio, maximum performance)
|
| 2 |
+
#
|
| 3 |
+
# ⚠️ TRT engines are NOT provided pre-built — you must build them yourself
|
| 4 |
+
# from the ONNX models. TRT engines are tied to your specific GPU architecture
|
| 5 |
+
# and TensorRT version.
|
| 6 |
+
#
|
| 7 |
+
# Requirements:
|
| 8 |
+
# pip install -e ".[llama-cpp-trt,llama-cpp-torch]"
|
| 9 |
+
#
|
| 10 |
+
# Build engines:
|
| 11 |
+
# bash moss_audio_tokenizer/trt/build_engine.sh \
|
| 12 |
+
# weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
|
| 13 |
+
# weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
|
| 14 |
+
# weights/MOSS-Audio-Tokenizer-TRT
|
| 15 |
+
#
|
| 16 |
+
# ⚠️ IMPORTANT: maxShapes in build_engine.sh controls the maximum audio length
|
| 17 |
+
# your engine can handle. See the script's comments for details.
|
| 18 |
+
|
| 19 |
+
# ── Model paths ──────────────────────────────────────────────────────────────
|
| 20 |
+
|
| 21 |
+
backbone_gguf: weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf
|
| 22 |
+
embedding_dir: weights/MOSS-TTS-GGUF/embeddings
|
| 23 |
+
lm_head_dir: weights/MOSS-TTS-GGUF/lm_heads
|
| 24 |
+
tokenizer_dir: weights/MOSS-TTS-GGUF/tokenizer
|
| 25 |
+
|
| 26 |
+
# ── Audio tokenizer ──────────────────────────────────────────────────────────
|
| 27 |
+
|
| 28 |
+
audio_backend: trt
|
| 29 |
+
audio_encoder_trt: weights/MOSS-Audio-Tokenizer-TRT/encoder.engine
|
| 30 |
+
audio_decoder_trt: weights/MOSS-Audio-Tokenizer-TRT/decoder.engine
|
| 31 |
+
|
| 32 |
+
# ── LM heads backend ────────────────────────────────────────────────────────
|
| 33 |
+
|
| 34 |
+
heads_backend: auto
|
| 35 |
+
|
| 36 |
+
# ── Runtime settings ─────────────────────────────────────────────────────────
|
| 37 |
+
|
| 38 |
+
n_ctx: 4096
|
| 39 |
+
n_batch: 512
|
| 40 |
+
n_threads: 32
|
| 41 |
+
n_gpu_layers: -1
|
| 42 |
+
max_new_tokens: 3072
|
| 43 |
+
use_gpu_audio: true
|
| 44 |
+
|
| 45 |
+
# ── Sampling parameters ──────────────────────────────────────────────────────
|
| 46 |
+
|
| 47 |
+
text_temperature: 1.5
|
| 48 |
+
text_top_p: 1.0
|
| 49 |
+
text_top_k: 50
|
| 50 |
+
|
| 51 |
+
audio_temperature: 1.7
|
| 52 |
+
audio_top_p: 0.8
|
| 53 |
+
audio_top_k: 25
|
| 54 |
+
audio_repetition_penalty: 1.0
|
docs/moss_sound_effect_model_card.md
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-SoundEffect Model Card
|
| 2 |
+
|
| 3 |
+
**MOSS-SoundEffect** is the **environment sound & sound effect generation model** in the **MOSS‑TTS Family**. It generates ambient soundscapes and concrete sound effects directly from text descriptions, and is designed to complement speech content with immersive context in production workflows.
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## 1. Overview
|
| 7 |
+
|
| 8 |
+
### 1.1 TTS Family Positioning
|
| 9 |
+
|
| 10 |
+
MOSS-SoundEffect is designed as an audio generation backbone for creating high-fidelity environmental and action sounds from text, serving both scalable content pipelines and a strong research baseline for controllable audio generation.
|
| 11 |
+
|
| 12 |
+
**Design goals**
|
| 13 |
+
* **Coverage & richness**: broad sound taxonomy with layered ambience and realistic texture
|
| 14 |
+
* **Composability**: easy integration into creative pipelines (games/film/tools) and synthetic data generation setups
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
### 1.2 Key Capabilities
|
| 18 |
+
MOSS‑SoundEffect focuses on **contextual audio completion** beyond speech, enabling creators and systems to enrich scenes with believable acoustic environments and action‑level cues.
|
| 19 |
+
|
| 20 |
+
**What it can generate**
|
| 21 |
+
- **Natural environments**: e.g., “fresh snow crunching under footsteps.”
|
| 22 |
+
- **Urban environments**: e.g., “a sports car roaring past on the highway.”
|
| 23 |
+
- **Animals & creatures**: e.g., “early morning park with birds chirping in a quiet atmosphere.”
|
| 24 |
+
- **Human actions**: e.g., “clear footsteps echoing on concrete at a steady rhythm.”
|
| 25 |
+
|
| 26 |
+
**Why it matters**
|
| 27 |
+
- Completes **scene immersion** for narrative content, film/TV, documentaries, games, and podcasts.
|
| 28 |
+
- Supports **voice agents** and interactive systems that need ambient context, not just speech.
|
| 29 |
+
- Acts as the **sound‑design layer** of the MOSS‑TTS Family’s end‑to‑end workflow.
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
### 1.3 Model Architecture
|
| 34 |
+
**MOSS-SoundEffect** employs the **MossTTSDelay** architecture (see [moss_tts_delay/README.md](../moss_tts_delay/README.md)), reusing the same discrete token generation backbone for audio synthesis. A text prompt (optionally with simple control tags such as **duration**) is tokenized and fed into the Delay-pattern autoregressive model to predict **RVQ audio tokens** over time. The generated tokens are then decoded by the audio tokenizer/vocoder to produce high-fidelity sound effects, enabling consistent quality and controllable length across diverse SFX categories.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
### 1.4 Released Models
|
| 39 |
+
**Recommended decoding hyperparameters**
|
| 40 |
+
| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
|
| 41 |
+
|---|---:|---:|---:|---:|
|
| 42 |
+
| **MOSS-SoundEffect** | 1.5 | 0.6 | 50 | 1.2 |
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
## 2. Quick Start
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
from pathlib import Path
|
| 51 |
+
import importlib.util
|
| 52 |
+
import torch
|
| 53 |
+
import torchaudio
|
| 54 |
+
from transformers import AutoModel, AutoProcessor
|
| 55 |
+
# Disable the broken cuDNN SDPA backend
|
| 56 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 57 |
+
# Keep these enabled as fallbacks
|
| 58 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 59 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 60 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-SoundEffect"
|
| 64 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 65 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 66 |
+
|
| 67 |
+
def resolve_attn_implementation() -> str:
|
| 68 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 69 |
+
if (
|
| 70 |
+
device == "cuda"
|
| 71 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 72 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 73 |
+
):
|
| 74 |
+
major, _ = torch.cuda.get_device_capability()
|
| 75 |
+
if major >= 8:
|
| 76 |
+
return "flash_attention_2"
|
| 77 |
+
|
| 78 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 79 |
+
if device == "cuda":
|
| 80 |
+
return "sdpa"
|
| 81 |
+
|
| 82 |
+
# CPU fallback.
|
| 83 |
+
return "eager"
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
attn_implementation = resolve_attn_implementation()
|
| 87 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 88 |
+
|
| 89 |
+
processor = AutoProcessor.from_pretrained(
|
| 90 |
+
pretrained_model_name_or_path,
|
| 91 |
+
trust_remote_code=True,
|
| 92 |
+
)
|
| 93 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 94 |
+
|
| 95 |
+
text_1 = "雷声隆隆,雨声淅沥。"
|
| 96 |
+
text_2 = "清晰脚步声在水泥地面回响,节奏稳定。"
|
| 97 |
+
|
| 98 |
+
conversations = [
|
| 99 |
+
[processor.build_user_message(ambient_sound=text_1)],
|
| 100 |
+
[processor.build_user_message(ambient_sound=text_2)]
|
| 101 |
+
]
|
| 102 |
+
|
| 103 |
+
model = AutoModel.from_pretrained(
|
| 104 |
+
pretrained_model_name_or_path,
|
| 105 |
+
trust_remote_code=True,
|
| 106 |
+
attn_implementation=attn_implementation,
|
| 107 |
+
torch_dtype=dtype,
|
| 108 |
+
).to(device)
|
| 109 |
+
model.eval()
|
| 110 |
+
|
| 111 |
+
batch_size = 1
|
| 112 |
+
|
| 113 |
+
save_dir = Path("inference_root")
|
| 114 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 115 |
+
sample_idx = 0
|
| 116 |
+
with torch.no_grad():
|
| 117 |
+
for start in range(0, len(conversations), batch_size):
|
| 118 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 119 |
+
batch = processor(batch_conversations, mode="generation")
|
| 120 |
+
input_ids = batch["input_ids"].to(device)
|
| 121 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 122 |
+
|
| 123 |
+
outputs = model.generate(
|
| 124 |
+
input_ids=input_ids,
|
| 125 |
+
attention_mask=attention_mask,
|
| 126 |
+
max_new_tokens=4096,
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
for message in processor.decode(outputs):
|
| 130 |
+
audio = message.audio_codes_list[0]
|
| 131 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 132 |
+
sample_idx += 1
|
| 133 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
### Input Types
|
| 137 |
+
|
| 138 |
+
**UserMessage**
|
| 139 |
+
| Field | Type | Required | Description |
|
| 140 |
+
|---|---|---:|---|
|
| 141 |
+
| `ambient_sound` | `str` | Yes | Description of environment sound & sound effect |
|
| 142 |
+
| `tokens` | `int` | No | Expected number of audio tokens. **1s ≈ 12.5 tokens**. |
|
docs/moss_tts_model_card.md
ADDED
|
@@ -0,0 +1,427 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS Model Card
|
| 2 |
+
|
| 3 |
+
**MOSS-TTS** is a next-generation, production-grade TTS foundation model focused on **voice cloning**, **ultra-long stable speech generation**, **token-level duration control**, **multilingual & code-switched synthesis**, and **fine-grained Pinyin/phoneme-level pronunciation control**. It is built on a clean autoregressive discrete-token recipe that emphasizes high-quality audio tokenization, large-scale diverse pre-training data, and efficient discrete token modeling.
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
## 1. Overview
|
| 8 |
+
|
| 9 |
+
### 1.1 TTS Family Positioning
|
| 10 |
+
MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.
|
| 11 |
+
|
| 12 |
+
**Design goals**
|
| 13 |
+
- **Production readiness**: robust voice cloning with stable, on-brand speaker identity at scale
|
| 14 |
+
- **Controllability**: duration and pronunciation controls that integrate into real workflows
|
| 15 |
+
- **Long-form stability**: consistent identity and delivery for extended narration
|
| 16 |
+
- **Multilingual coverage**: multilingual and code-switched synthesis as first-class capabilities
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
### 1.2 Key Capabilities
|
| 21 |
+
|
| 22 |
+
MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.
|
| 23 |
+
|
| 24 |
+
* **State-of-the-art evaluation performance** — top-tier objective and subjective results across standard TTS benchmarks and in-house human preference testing, validating both fidelity and naturalness.
|
| 25 |
+
* **Zero-shot Voice Cloning (Voice Clone)** — clone a target speaker’s timbre (and part of speaking style) from short reference audio, without speaker-specific fine-tuning.
|
| 26 |
+
* **Ultra-long Speech Generation (up to 1 hour)** — support continuous long-form speech generation for up to one hour in a single run, designed for extended narration and long-session content creation.
|
| 27 |
+
* **Token-level Duration Control** — control pacing, rhythm, pauses, and speaking rate at token resolution for precise alignment and expressive delivery.
|
| 28 |
+
* **Phoneme-level Pronunciation Control** — supports:
|
| 29 |
+
|
| 30 |
+
* pure **Pinyin** input
|
| 31 |
+
* pure **IPA** phoneme input
|
| 32 |
+
* mixed **Chinese / English / Pinyin / IPA** input in any combination
|
| 33 |
+
* **Multilingual support** — high-quality multilingual synthesis with robust generalization across languages and accents.
|
| 34 |
+
* **Code-switching** — natural mixed-language generation within a single utterance (e.g., Chinese–English), with smooth transitions, consistent speaker identity, and pronunciation-aware rendering on both sides of the switch.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
### 1.3 Model Architecture
|
| 39 |
+
|
| 40 |
+
MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.
|
| 41 |
+
|
| 42 |
+
**Architecture A: Delay Pattern (MossTTSDelay)**
|
| 43 |
+
- Single Transformer backbone with **(n_vq + 1) heads**.
|
| 44 |
+
- Uses **delay scheduling** for multi-codebook audio tokens.
|
| 45 |
+
- Strong long-context stability, efficient inference, and production-friendly behavior.
|
| 46 |
+
|
| 47 |
+
**Architecture B: Global Latent + Local Transformer (MossTTSLocal)**
|
| 48 |
+
- Backbone produces a **global latent** per time step.
|
| 49 |
+
- A lightweight **Local Transformer** emits a token block per step.
|
| 50 |
+
- **Streaming-friendly** with simpler alignment (no delay scheduling).
|
| 51 |
+
|
| 52 |
+
**Why train both?**
|
| 53 |
+
- **Exploration of architectural potential** and validation across multiple generation paradigms.
|
| 54 |
+
- **Different tradeoffs**: Delay pattern tends to be faster and more stable for long-form synthesis; Local is smaller and excels on objective benchmarks.
|
| 55 |
+
- **Open-source value**: two strong baselines for research, ablation, and downstream innovation.
|
| 56 |
+
|
| 57 |
+
For full details, see:
|
| 58 |
+
- **`moss_tts_delay/README.md`**
|
| 59 |
+
- **`moss_tts_local/README.md`**
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
### 1.4 Released Models
|
| 64 |
+
|
| 65 |
+
| Model | Description |
|
| 66 |
+
|---|---|
|
| 67 |
+
| **MossTTSDelay-8B** | **Recommended for production**. Faster inference, stronger long-context stability, and robust voice cloning quality. Best for large-scale deployment and long-form narration. |
|
| 68 |
+
| **MossTTSLocal-1.7B** | **Recommended for evaluation and research**. Smaller model size with SOTA objective metrics. Great for quick experiments, ablations, and academic studies. |
|
| 69 |
+
|
| 70 |
+
**Recommended decoding hyperparameters (per model)**
|
| 71 |
+
|
| 72 |
+
| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
|
| 73 |
+
|---|---:|---:|---:|---:|
|
| 74 |
+
| **MossTTSDelay-8B** | 1.7 | 0.8 | 25 | 1.0 |
|
| 75 |
+
| **MossTTSLocal-1.7B** | 1.0 | 0.95 | 50 | 1.1 |
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
## 2. Quick Start
|
| 81 |
+
|
| 82 |
+
> Tip: For production usage, prioritize **MossTTSDelay-8B**. The examples below use this model; **MossTTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](../moss_tts_local/README.md).
|
| 83 |
+
|
| 84 |
+
MOSS-TTS provides a convenient `generate` interface for rapid usage. The examples below cover:
|
| 85 |
+
1. Direct generation (Chinese / English / Pinyin / IPA)
|
| 86 |
+
2. Voice cloning
|
| 87 |
+
3. Duration control
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from pathlib import Path
|
| 91 |
+
import importlib.util
|
| 92 |
+
import torch
|
| 93 |
+
import torchaudio
|
| 94 |
+
from transformers import AutoModel, AutoProcessor
|
| 95 |
+
# Disable the broken cuDNN SDPA backend
|
| 96 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 97 |
+
# Keep these enabled as fallbacks
|
| 98 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 99 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 100 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"
|
| 104 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 105 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 106 |
+
|
| 107 |
+
def resolve_attn_implementation() -> str:
|
| 108 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 109 |
+
if (
|
| 110 |
+
device == "cuda"
|
| 111 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 112 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 113 |
+
):
|
| 114 |
+
major, _ = torch.cuda.get_device_capability()
|
| 115 |
+
if major >= 8:
|
| 116 |
+
return "flash_attention_2"
|
| 117 |
+
|
| 118 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 119 |
+
if device == "cuda":
|
| 120 |
+
return "sdpa"
|
| 121 |
+
|
| 122 |
+
# CPU fallback.
|
| 123 |
+
return "eager"
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
attn_implementation = resolve_attn_implementation()
|
| 127 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 128 |
+
|
| 129 |
+
processor = AutoProcessor.from_pretrained(
|
| 130 |
+
pretrained_model_name_or_path,
|
| 131 |
+
trust_remote_code=True,
|
| 132 |
+
)
|
| 133 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 134 |
+
|
| 135 |
+
text_1 = "亲爱的你,\n你好呀。\n\n今天,我想用最认真、最温柔的声音,对你说一些重要的话。\n这些话,像一颗小小的星星,希望能在你的心里慢慢发光。\n\n首先,我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候,\n窗外有光,屋子里很安静,\n你的心是轻轻的,没有着急,也没有害怕。\n\n希望你吃饭的时候胃口很好,\n走路的时候脚步稳稳,\n晚上睡觉的时候,能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题,\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候,\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友,\n温柔地对待小动物,\n也温柔地对待自己。\n\n如果有一天你犯了错,\n请不要太快责怪自己,\n因为每一个认真成长的人,\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时,\n当你第一次举手发言时,\n当你遇到困难、感到害怕的时候,\n希望你能轻轻地告诉自己:\n“我可以试一试。”\n\n就算没有一次成功,也没有关系。\n失败不是坏事,\n它只是告诉你,你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人,\n把笑声送给身边的人,\n因为快乐被分享的时候,\n会变得更大、更亮。\n\n如果有一天你感到难过,\n我希望你知道——\n难过并不丢脸,\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方,\n慢慢把心里的话说出来,\n然后再一次抬起头,看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大,\n也许很小,\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大,\n在时间里慢慢变得清楚。\n\n最后,我想送你一个最最重要的祝福:\n\n愿你被世界温柔对待,\n也愿你成为一个温柔的人。\n\n愿你的每一天,\n都值得被记住,\n都值得被珍惜。\n\n亲爱的你,\n请记住,\n你是独一无二的,\n你已经很棒了,\n而你的未来,\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福,\n祝你永远带着笑容向前走。"
|
| 136 |
+
text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."
|
| 137 |
+
text_3 = "nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?"
|
| 138 |
+
text_4 = "nin2 hao3,qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3?"
|
| 139 |
+
text_5 = "您好,请问您来自哪 zuo4 cheng2 shi4?"
|
| 140 |
+
text_6 = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"
|
| 141 |
+
|
| 142 |
+
# Use audio from ./assets/audio to avoid downloading from the cloud.
|
| 143 |
+
ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
|
| 144 |
+
ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
|
| 145 |
+
|
| 146 |
+
conversations = [
|
| 147 |
+
# Direct TTS (no reference)
|
| 148 |
+
[processor.build_user_message(text=text_1)],
|
| 149 |
+
[processor.build_user_message(text=text_2)],
|
| 150 |
+
# Pinyin or IPA input
|
| 151 |
+
[processor.build_user_message(text=text_3)],
|
| 152 |
+
[processor.build_user_message(text=text_4)],
|
| 153 |
+
[processor.build_user_message(text=text_5)],
|
| 154 |
+
[processor.build_user_message(text=text_6)],
|
| 155 |
+
# Voice cloning (with reference)
|
| 156 |
+
[processor.build_user_message(text=text_1, reference=[ref_audio_1])],
|
| 157 |
+
[processor.build_user_message(text=text_2, reference=[ref_audio_2])],
|
| 158 |
+
# Duration control
|
| 159 |
+
[processor.build_user_message(text=text_2, tokens=325)],
|
| 160 |
+
[processor.build_user_message(text=text_2, tokens=600)],
|
| 161 |
+
]
|
| 162 |
+
|
| 163 |
+
model = AutoModel.from_pretrained(
|
| 164 |
+
pretrained_model_name_or_path,
|
| 165 |
+
trust_remote_code=True,
|
| 166 |
+
attn_implementation=attn_implementation,
|
| 167 |
+
torch_dtype=dtype,
|
| 168 |
+
).to(device)
|
| 169 |
+
model.eval()
|
| 170 |
+
|
| 171 |
+
batch_size = 1
|
| 172 |
+
|
| 173 |
+
save_dir = Path("inference_root")
|
| 174 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 175 |
+
sample_idx = 0
|
| 176 |
+
with torch.no_grad():
|
| 177 |
+
for start in range(0, len(conversations), batch_size):
|
| 178 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 179 |
+
batch = processor(batch_conversations, mode="generation")
|
| 180 |
+
input_ids = batch["input_ids"].to(device)
|
| 181 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 182 |
+
|
| 183 |
+
outputs = model.generate(
|
| 184 |
+
input_ids=input_ids,
|
| 185 |
+
attention_mask=attention_mask,
|
| 186 |
+
max_new_tokens=4096,
|
| 187 |
+
)
|
| 188 |
+
|
| 189 |
+
for message in processor.decode(outputs):
|
| 190 |
+
audio = message.audio_codes_list[0]
|
| 191 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 192 |
+
sample_idx += 1
|
| 193 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 194 |
+
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
### Continuation + Voice Cloning (Prefix Audio + Text)
|
| 198 |
+
|
| 199 |
+
MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.
|
| 200 |
+
|
| 201 |
+
```python
|
| 202 |
+
from pathlib import Path
|
| 203 |
+
import importlib.util
|
| 204 |
+
import torch
|
| 205 |
+
import torchaudio
|
| 206 |
+
from transformers import AutoModel, AutoProcessor
|
| 207 |
+
# Disable the broken cuDNN SDPA backend
|
| 208 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 209 |
+
# Keep these enabled as fallbacks
|
| 210 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 211 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 212 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"
|
| 216 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 217 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 218 |
+
|
| 219 |
+
def resolve_attn_implementation() -> str:
|
| 220 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 221 |
+
if (
|
| 222 |
+
device == "cuda"
|
| 223 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 224 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 225 |
+
):
|
| 226 |
+
major, _ = torch.cuda.get_device_capability()
|
| 227 |
+
if major >= 8:
|
| 228 |
+
return "flash_attention_2"
|
| 229 |
+
|
| 230 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 231 |
+
if device == "cuda":
|
| 232 |
+
return "sdpa"
|
| 233 |
+
|
| 234 |
+
# CPU fallback.
|
| 235 |
+
return "eager"
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
attn_implementation = resolve_attn_implementation()
|
| 239 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 240 |
+
|
| 241 |
+
processor = AutoProcessor.from_pretrained(
|
| 242 |
+
pretrained_model_name_or_path,
|
| 243 |
+
trust_remote_code=True
|
| 244 |
+
)
|
| 245 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 246 |
+
|
| 247 |
+
text_1 = "亲爱的你,\n你好呀。\n\n今天,我想用最认真、最温柔的声音,对你说一些重要的话。\n这些话,像一颗小小的星星,希望能在你的心里慢慢发光。\n\n首先,我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候,\n窗外有光,屋子里很安静,\n你的心是轻轻的,没有着急,也没有害怕。\n\n希望你吃饭的时候胃口很好,\n走路的时候脚步稳稳,\n晚上睡觉的时候,能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题,\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候,\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友,\n温柔地对待小动物,\n也温柔地对待自己。\n\n如果有一天你犯了错,\n请不要太快责怪自己,\n因为每一个认真成长的人,\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时,\n当你第一次举手发言时,\n当你遇到困难、感到害怕的时候,\n希望你能轻轻地告诉自己:\n“我可以试一试。”\n\n就算没有一次成功,也没有关系。\n失败不是坏事,\n它只是告诉你,你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人,\n把笑声送给身边的人,\n因为快乐被分享的时候,\n会变得更大、更亮。\n\n如果有一天你感到难过,\n我希望你知道——\n难过��不丢脸,\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方,\n慢慢把心里的话说出来,\n然后再一次抬起头,看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大,\n也许很小,\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大,\n在时间里慢慢变得清楚。\n\n最后,我想送你一个最最重要的祝福:\n\n愿你被世界温柔对待,\n也愿你成为一个温柔的人。\n\n愿你的每一天,\n都值得被记住,\n都值得被珍惜。\n\n亲爱的你,\n请记住,\n你是独一无二的,\n你已经很棒了,\n而你的未来,\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福,\n祝你永远带着笑容向前走。"
|
| 248 |
+
text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."
|
| 249 |
+
ref_text_1 = "太阳系八大行星之一。"
|
| 250 |
+
ref_text_2 = "But I really can't complain about not having a normal college experience to you."
|
| 251 |
+
# Use audio from ./assets/audio to avoid downloading from the cloud.
|
| 252 |
+
ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
|
| 253 |
+
ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
|
| 254 |
+
|
| 255 |
+
conversations = [
|
| 256 |
+
# Continuatoin only
|
| 257 |
+
[
|
| 258 |
+
processor.build_user_message(text=ref_text_1 + text_1),
|
| 259 |
+
processor.build_assistant_message(audio_codes_list=[ref_audio_1])
|
| 260 |
+
],
|
| 261 |
+
# Continuation with voice cloning
|
| 262 |
+
[
|
| 263 |
+
processor.build_user_message(text=ref_text_2 + text_2, reference=[ref_audio_2]),
|
| 264 |
+
processor.build_assistant_message(audio_codes_list=[ref_audio_2])
|
| 265 |
+
],
|
| 266 |
+
]
|
| 267 |
+
|
| 268 |
+
model = AutoModel.from_pretrained(
|
| 269 |
+
pretrained_model_name_or_path,
|
| 270 |
+
trust_remote_code=True,
|
| 271 |
+
attn_implementation=attn_implementation,
|
| 272 |
+
torch_dtype=dtype,
|
| 273 |
+
).to(device)
|
| 274 |
+
model.eval()
|
| 275 |
+
|
| 276 |
+
batch_size = 1
|
| 277 |
+
|
| 278 |
+
save_dir = Path("inference_root")
|
| 279 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 280 |
+
sample_idx = 0
|
| 281 |
+
with torch.no_grad():
|
| 282 |
+
for start in range(0, len(conversations), batch_size):
|
| 283 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 284 |
+
batch = processor(batch_conversations, mode="continuation")
|
| 285 |
+
input_ids = batch["input_ids"].to(device)
|
| 286 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 287 |
+
|
| 288 |
+
outputs = model.generate(
|
| 289 |
+
input_ids=input_ids,
|
| 290 |
+
attention_mask=attention_mask,
|
| 291 |
+
max_new_tokens=4096,
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
for message in processor.decode(outputs):
|
| 295 |
+
audio = message.audio_codes_list[0]
|
| 296 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 297 |
+
sample_idx += 1
|
| 298 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 299 |
+
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
|
| 304 |
+
### Input Types
|
| 305 |
+
|
| 306 |
+
**UserMessage**
|
| 307 |
+
|
| 308 |
+
| Field | Type | Required | Description |
|
| 309 |
+
|---|---|---:|---|
|
| 310 |
+
| `text` | `str` | Yes | Text to synthesize. Supports Chinese, English, German, French, Spanish, Japanese, Korean, etc. Can mix raw text with Pinyin or IPA for pronunciation control. |
|
| 311 |
+
| `reference` | `List[str]` | No | Reference audio for voice cloning. For current MOSS-TTS, **one audio** is expected in the list. |
|
| 312 |
+
| `tokens` | `int` | No | Expected number of audio tokens. **1s ≈ 12.5 tokens**. |
|
| 313 |
+
|
| 314 |
+
**AssistantMessage**
|
| 315 |
+
|
| 316 |
+
| Field | Type | Required | Description |
|
| 317 |
+
|---|---|---:|---|
|
| 318 |
+
| `audio_codes_list` | `List[str]` | Only for continuation | Prefix audio for continuation-based cloning. Use audio file paths or URLs. |
|
| 319 |
+
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
### Generation Hyperparameters
|
| 323 |
+
|
| 324 |
+
| Parameter | Type | Default | Description |
|
| 325 |
+
|---|---|---:|---|
|
| 326 |
+
| `max_new_tokens` | `int` | — | Controls total generated audio tokens. Use duration rule: **1s ≈ 12.5 tokens**. |
|
| 327 |
+
| `audio_temperature` | `float` | 1.7 | Higher values increase variation; lower values stabilize prosody. |
|
| 328 |
+
| `audio_top_p` | `float` | 0.8 | Nucleus sampling cutoff. Lower values are more conservative. |
|
| 329 |
+
| `audio_top_k` | `int` | 25 | Top-K sampling. Lower values tighten sampling space. |
|
| 330 |
+
| `audio_repetition_penalty` | `float` | 1.0 | >1.0 discourages repeating patterns. |
|
| 331 |
+
|
| 332 |
+
> Note: MOSS-TTS is a pretrained base model and is **sensitive to decoding hyperparameters**. See **Released Models** for recommended defaults.
|
| 333 |
+
|
| 334 |
+
|
| 335 |
+
|
| 336 |
+
### Pinyin Input
|
| 337 |
+
|
| 338 |
+
Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.
|
| 339 |
+
|
| 340 |
+
```python
|
| 341 |
+
import re
|
| 342 |
+
from pypinyin import pinyin, Style
|
| 343 |
+
|
| 344 |
+
CN_PUNCT = r",。!?;:、()“”‘���"
|
| 345 |
+
|
| 346 |
+
|
| 347 |
+
def fix_punctuation_spacing(s: str) -> str:
|
| 348 |
+
s = re.sub(rf"\s+([{CN_PUNCT}])", r"\1", s)
|
| 349 |
+
s = re.sub(rf"([{CN_PUNCT}])\s+", r"\1", s)
|
| 350 |
+
return s
|
| 351 |
+
|
| 352 |
+
|
| 353 |
+
def zh_to_pinyin_tone3(text: str, strict: bool = True) -> str:
|
| 354 |
+
result = pinyin(
|
| 355 |
+
text,
|
| 356 |
+
style=Style.TONE3,
|
| 357 |
+
heteronym=False,
|
| 358 |
+
strict=strict,
|
| 359 |
+
errors="default",
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
s = " ".join(item[0] for item in result)
|
| 363 |
+
return fix_punctuation_spacing(s)
|
| 364 |
+
|
| 365 |
+
text = zh_to_pinyin_tone3("您好,请问您来自哪座城市?")
|
| 366 |
+
print(text)
|
| 367 |
+
|
| 368 |
+
# Expected: nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?
|
| 369 |
+
# Try: nin2 hao3,qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3?
|
| 370 |
+
```
|
| 371 |
+
|
| 372 |
+
|
| 373 |
+
|
| 374 |
+
### IPA Input
|
| 375 |
+
|
| 376 |
+
Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.
|
| 377 |
+
|
| 378 |
+
```python
|
| 379 |
+
from dp.phonemizer import Phonemizer
|
| 380 |
+
|
| 381 |
+
# Download a phonemizer checkpoint from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_ipa_forward.pt
|
| 382 |
+
model_path = "<path-to-phonemizer-checkpoint>"
|
| 383 |
+
phonemizer = Phonemizer.from_checkpoint(model_path)
|
| 384 |
+
|
| 385 |
+
english_texts = "Hello, may I ask which city you are from?"
|
| 386 |
+
phoneme_outputs = phonemizer(
|
| 387 |
+
english_texts,
|
| 388 |
+
lang="en_us",
|
| 389 |
+
batch_size=8
|
| 390 |
+
)
|
| 391 |
+
model_input_text = f"/{phoneme_outputs}/"
|
| 392 |
+
print(model_input_text)
|
| 393 |
+
|
| 394 |
+
# Expected: /həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/
|
| 395 |
+
```
|
| 396 |
+
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
## 3. Evaluation
|
| 400 |
+
MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.
|
| 401 |
+
|
| 402 |
+
| Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|
| 403 |
+
|---|---:|:---:|---:|---:|---:|---:|
|
| 404 |
+
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
|
| 405 |
+
| FishAudio-S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
|
| 406 |
+
| Seed-TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
|
| 407 |
+
| MiniMax-Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
|
| 408 |
+
| | | | | | | |
|
| 409 |
+
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |
|
| 410 |
+
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |
|
| 411 |
+
| CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |
|
| 412 |
+
| CosyVoice3 | 1.5B | ✅ | 2.22 | 72 | 1.12 | 78.1 |
|
| 413 |
+
| F5-TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |
|
| 414 |
+
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |
|
| 415 |
+
| FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |
|
| 416 |
+
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |
|
| 417 |
+
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |
|
| 418 |
+
| FishAudio-S1-mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
|
| 419 |
+
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
|
| 420 |
+
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
|
| 421 |
+
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
|
| 422 |
+
| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | **0.93** | 77.2 |
|
| 423 |
+
| Qwen3-TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
|
| 424 |
+
| Qwen3-TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |
|
| 425 |
+
| | | | | | | |
|
| 426 |
+
| MossTTSDelay | 8B | ✅ | 1.79 | 71.46 | 1.32 | 77.05 |
|
| 427 |
+
| MossTTSLocal | 1.7B | ✅ | 1.85 | **73.42** | 1.2 | **78.82** |
|
docs/moss_tts_realtime_model_card.md
ADDED
|
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTS-Realtime
|
| 2 |
+
**MOSS-TTS-Realtime** is a context-aware, multi-turn streaming TTS foundation model designed for real-time voice agents.
|
| 3 |
+
It natively supports spoken interactions by conditioning speech generation on both textual and acoustic history from previous dialogue turns.
|
| 4 |
+
By tightly integrating multi-turn context modeling with low-latency streaming synthesis, MOSS-TTS-Realtime generates incremental audio responses that preserve voice consistency and discourse coherence, enabling excellent natural and human-like conversational speech.
|
| 5 |
+
|
| 6 |
+
## 1. Overview
|
| 7 |
+
|
| 8 |
+
### 1.1 TTS Family Positioning
|
| 9 |
+
|
| 10 |
+
**MOSS-TTS-Realtime** is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.
|
| 11 |
+
|
| 12 |
+
**Key Capabilities**
|
| 13 |
+
* **Context-Aware & Expressive Speech Generation**: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.
|
| 14 |
+
|
| 15 |
+
* **High-Fidelity Voice Cloning with Multi-Turn Consistency**: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.
|
| 16 |
+
|
| 17 |
+
* **Long-Context**: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.
|
| 18 |
+
|
| 19 |
+
* **Highly Human-Like Speech with Natural Prosody**: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.
|
| 20 |
+
|
| 21 |
+
* **Multilingual Speech Support**: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.
|
| 22 |
+
|
| 23 |
+
### 1.2 Model Architecture
|
| 24 |
+
|
| 25 |
+

|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
### 1.3 Released Model
|
| 30 |
+
**Recommended decoding hyperparameters**
|
| 31 |
+
| Model | temperature | top_p | top_k | repetition_penalty | repetition_window
|
| 32 |
+
|---|---:|---:|---:|---:|---:|
|
| 33 |
+
| **MOSS-TTS-Realtime** | 0.8 | 0.6 | 30 | 1.1 | 50 |
|
| 34 |
+
|
| 35 |
+
## 2. Quickstart
|
| 36 |
+
|
| 37 |
+
### Environment Setup
|
| 38 |
+
Environment setup is the same as on the MOSS-TTS main page.
|
| 39 |
+
|
| 40 |
+
#### Using Conda
|
| 41 |
+
```bash
|
| 42 |
+
conda create -n moss-tts python=3.12 -y
|
| 43 |
+
conda activate moss-tts
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Install all required dependencies:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 50 |
+
cd MOSS-TTS
|
| 51 |
+
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
|
| 52 |
+
cd moss_tts_realtime
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
#### Using `uv`
|
| 56 |
+
```bash
|
| 57 |
+
# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
|
| 58 |
+
git clone https://github.com/OpenMOSS/MOSS-TTS.git
|
| 59 |
+
cd MOSS-TTS
|
| 60 |
+
uv venv --python 3.12 .venv
|
| 61 |
+
source .venv/bin/activate
|
| 62 |
+
uv pip install --torch-backend cu128 -e .
|
| 63 |
+
cd moss_tts_realtime
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Basic Usage (Non streaming)
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
import importlib.util
|
| 70 |
+
import torch
|
| 71 |
+
import torchaudio
|
| 72 |
+
from transformers import AutoTokenizer, AutoModel
|
| 73 |
+
from mossttsrealtime.modeling_mossttsrealtime import MossTTSRealtime
|
| 74 |
+
from inferencer import MossTTSRealtimeInference
|
| 75 |
+
|
| 76 |
+
CODEC_SAMPLE_RATE = 24000
|
| 77 |
+
|
| 78 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 79 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 80 |
+
|
| 81 |
+
def resolve_attn_implementation() -> str:
|
| 82 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 83 |
+
if (
|
| 84 |
+
device == "cuda"
|
| 85 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 86 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 87 |
+
):
|
| 88 |
+
major, _ = torch.cuda.get_device_capability()
|
| 89 |
+
if major >= 8:
|
| 90 |
+
return "flash_attention_2"
|
| 91 |
+
|
| 92 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 93 |
+
if device == "cuda":
|
| 94 |
+
return "sdpa"
|
| 95 |
+
|
| 96 |
+
# CPU fallback.
|
| 97 |
+
return "eager"
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
attn_implementation = resolve_attn_implementation()
|
| 101 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 102 |
+
|
| 103 |
+
model = MossTTSRealtime.from_pretrained("OpenMOSS-Team/MOSS-TTS-Realtime", attn_implementation=attn_implementation, torch_dtype=torch.bfloat16).to(device)
|
| 104 |
+
tokenizer = AutoTokenizer.from_pretrained("OpenMOSS-Team/MOSS-TTS-Realtime")
|
| 105 |
+
codec = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-Audio-Tokenizer", trust_remote_code=True).eval()
|
| 106 |
+
codec = codec.to(device)
|
| 107 |
+
|
| 108 |
+
inferencer = MossTTSRealtimeInference(model, tokenizer, max_length=5000, codec=codec, codec_sample_rate=CODEC_SAMPLE_RATE, codec_encode_kwargs={"chunk_duration": 8})
|
| 109 |
+
|
| 110 |
+
text = ["Welcome to the world of MOSS TTS Realtime. Experience how text transforms into smooth, human-like speech in real time.", "MOSS TTS Realtime is a context-aware multi-turn streaming TTS, a speech generation foundation model designed for voice agents."]
|
| 111 |
+
|
| 112 |
+
# if you don't use reference audio, you can set reference_audio_path = ["", ""]
|
| 113 |
+
reference_audio_path = ["./audio/prompt_audio.mp3", "./audio/prompt_audio1.mp3"]
|
| 114 |
+
|
| 115 |
+
result = inferencer.generate(
|
| 116 |
+
text=text,
|
| 117 |
+
reference_audio_path=reference_audio_path,
|
| 118 |
+
temperature=0.8,
|
| 119 |
+
top_p=0.6,
|
| 120 |
+
top_k=30,
|
| 121 |
+
repetition_penalty=1.1,
|
| 122 |
+
repetition_window=50,
|
| 123 |
+
device=device,
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
for i, generated_tokens, in enumerate(result):
|
| 127 |
+
output = torch.tensor(generated_tokens).to(device)
|
| 128 |
+
decode_result = codec.decode(output.permute(1, 0), chunk_duration=8)
|
| 129 |
+
wav = decode_result["audio"][0].cpu().detach()
|
| 130 |
+
torchaudio.save(f'{i}.wav', wav, CODEC_SAMPLE_RATE)
|
| 131 |
+
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
### Launch the Gradio streaming demo (recommended)
|
| 135 |
+
You can use streaming output in Gradio with the following usage.
|
| 136 |
+
```bash
|
| 137 |
+
python3 app.py
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
### Single-turn Streaming Usage
|
| 141 |
+
`example_llm_stream_to_tts.py` demonstrates a single turn has no usage of context:
|
| 142 |
+
```bash
|
| 143 |
+
python3 example_llm_stream_to_tts.py \
|
| 144 |
+
--model_path OpenMOSS-Team/MOSS-TTS-Realtime \
|
| 145 |
+
--codec_path OpenMOSS-Team/MOSS-Audio-Tokenizer \
|
| 146 |
+
--prompt_wav ./audio/prompt_audio1.mp3
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
Key: provide a streaming text_deltas source that yields incremental text chunks (e.g., vLLM streaming output, or delta text from OpenAI ChatCompletions).
|
| 150 |
+
|
| 151 |
+
```python
|
| 152 |
+
with codec.streaming(batch_size=1):
|
| 153 |
+
for delta in text_deltas:
|
| 154 |
+
print(delta, end="", flush=True)
|
| 155 |
+
audio_frames = session.push_text(delta)
|
| 156 |
+
yield from decode_audio_frames(
|
| 157 |
+
audio_frames, decoder, codebook_size, audio_eos_token
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
audio_frames = session.end_text()
|
| 161 |
+
yield from decode_audio_frames(
|
| 162 |
+
audio_frames, decoder, codebook_size, audio_eos_token
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
while True:
|
| 166 |
+
audio_frames = session.drain(max_steps=1)
|
| 167 |
+
if not audio_frames:
|
| 168 |
+
break
|
| 169 |
+
yield from decode_audio_frames(
|
| 170 |
+
audio_frames, decoder, codebook_size, audio_eos_token
|
| 171 |
+
)
|
| 172 |
+
if session.inferencer.is_finished:
|
| 173 |
+
break
|
| 174 |
+
|
| 175 |
+
yield from flush_decoder(decoder)
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
### Multi-turn streaming (KV cache reuse)
|
| 179 |
+
|
| 180 |
+
`example_multiturn_stream_to_tts.py` demonstrates a multi-turn dialogue usage with context:
|
| 181 |
+
- turn 0 resets KV cache
|
| 182 |
+
- turn 1+ reuses KV cache to carry all previous context
|
| 183 |
+
|
| 184 |
+
```bash
|
| 185 |
+
python3 example_multiturn_stream_to_tts.py \
|
| 186 |
+
--model_path OpenMOSS-Team/MOSS-TTS-Realtime \
|
| 187 |
+
--codec_path OpenMOSS-Team/MOSS-Audio-Tokenizer \
|
| 188 |
+
--prompt_wav ./audio/prompt_audio1.mp3
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
## 3. Evaluation
|
| 195 |
+
MOSS-TTS-Realtime achieves state-of-the-art or near state-of-the-art performance among open-source systems on the zero-shot TTS benchmarks Seed-TTS-eval, while remaining competitive with leading closed-source models.
|
| 196 |
+
|
| 197 |
+
| Model | Params | Open-source | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|
| 198 |
+
|---|---:|:---:|---:|---:|---:|---:|
|
| 199 |
+
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |
|
| 200 |
+
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78 |
|
| 201 |
+
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72 | 1.12 | 78.1 |
|
| 202 |
+
| FishAudio-S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |
|
| 203 |
+
| Seed-TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |
|
| 204 |
+
| MiniMax-Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |
|
| 205 |
+
| | | | | | | |
|
| 206 |
+
| FishAudio-S1-mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |
|
| 207 |
+
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |
|
| 208 |
+
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |
|
| 209 |
+
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |
|
| 210 |
+
| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 |
|
| 211 |
+
| Qwen3-TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |
|
| 212 |
+
| Qwen3-TTS | 1.7B | ✅ | 1.5 | 71.45 | 1.33 | 76.72 |
|
| 213 |
+
| **Moss-TTS-Realtime** | 1.7B | ✅ | **1.971** | **68.9** | **1.07** | **76.7** |
|
docs/moss_ttsd_model_card.md
ADDED
|
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-TTSD
|
| 2 |
+
|
| 3 |
+
**MOSS-TTSD** is a long-form spoken dialogue generation model that enables highly expressive multi-party conversational speech synthesis across multiple languages. It supports continuous long-duration generation, flexible multi-speaker dialogue control, and state-of-the-art zero-shot voice cloning with only short reference audio. MOSS-TTSD is designed for real-world long-form content creation, including podcasts, audiobook, sports and esports commentary, dubbing, crosstalk, and entertainment scenarios.
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## 1. Overview
|
| 7 |
+
|
| 8 |
+
### 1.1 TTS Family Positioning
|
| 9 |
+
MOSS-TTSD is the Long-Form Dialogue Specialist in our open-source TTS Family. While our foundational models focus on high-fidelity single-speaker synthesis, MOSS-TTSD extends this capability into the realm of complex, multi-party interactions. It is designed to bridge the gap between distinct audio samples and cohesive, continuous conversation.
|
| 10 |
+
|
| 11 |
+
**Design Goals**
|
| 12 |
+
- **Authentic Interaction**: Capturing the natural rhythm, overlaps, and dynamics of human conversation.
|
| 13 |
+
- **Sustained Coherence**: Maintaining speaker identity and contextual consistency over extended durations (up to 1 hour).
|
| 14 |
+
- **Production Adaptability**: Serving diverse high-end scenarios from rigorous audiobook narration to dynamic sports commentary.
|
| 15 |
+
|
| 16 |
+
### 1.2 Key Capabilities
|
| 17 |
+
MOSS-TTSD transforms static text into living conversations, offering features specifically optimized for multi-speaker environments:
|
| 18 |
+
|
| 19 |
+
- **Multi-Party Conversational Generation** — Unlike traditional TTS which optimizes for reading, MOSS-TTSD masters the rhythm of conversation. It supports 1 to 5 speakers with flexible control, handling natural turn-taking, overlapping speech patterns, and distinct persona maintenance.
|
| 20 |
+
|
| 21 |
+
- **Extreme Long-Context Modeling** — Moving beyond short-sentence generation, the model is architected for stability over long durations, supporting up to 60 minutes of coherent audio in a single session without losing speaker identity or prosodic quality.
|
| 22 |
+
|
| 23 |
+
- **Diverse Scenario Adaptation** — The model is fine-tuned on high-variability scenarios to handle different speaking styles:
|
| 24 |
+
- Conversational Media: AI Podcasts, Interviews.
|
| 25 |
+
- Dynamic Commentary: High-energy Sports/Esports shouting and analysis.
|
| 26 |
+
- Entertainment: Audiobooks (narrator + characters), Dubbing, and Crosstalk (Xiangsheng).
|
| 27 |
+
|
| 28 |
+
- **Multilingual & Zero-Shot Cloning** — Features state-of-the-art zero-shot voice cloning requiring only short reference audio (3-10s), with robust cross-lingual performance across major languages including Chinese, English, Japanese, and European languages.
|
| 29 |
+
|
| 30 |
+
### 1.3 Model Architecture
|
| 31 |
+
|
| 32 |
+
MOSS-TTSD is built on top of **Delay Pattern (MossTTSDelay)** from our MOSS-TTS foundation model — a single Transformer backbone with multi-head parallel prediction using delay scheduling for multi-codebook audio tokens.
|
| 33 |
+
|
| 34 |
+
For full architecture details, see **`moss_tts_delay/README.md`**.
|
| 35 |
+
|
| 36 |
+
### 1.4 Released Models
|
| 37 |
+
|
| 38 |
+
| Model | Architecture | NVQ | Parameters |
|
| 39 |
+
|-------|-------------|-----|------------|
|
| 40 |
+
| MOSS-TTSD | Delay Pattern (MossTTSDelay) | 16 | 8B |
|
| 41 |
+
|
| 42 |
+
**Recommended decoding hyperparameters**
|
| 43 |
+
|
| 44 |
+
| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
|
| 45 |
+
|---|---:|---:|---:|---:|
|
| 46 |
+
| **MOSS-TTSD** | 1.1 | 0.9 | 50 | 1.1 |
|
| 47 |
+
|
| 48 |
+
## 2. Quick Start
|
| 49 |
+
|
| 50 |
+
MOSS-TTSD uses a **continuation** workflow: provide reference audio for each speaker, their transcripts as a prefix, and the dialogue text to generate. The model continues in each speaker's identity.
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
from pathlib import Path
|
| 54 |
+
import importlib.util
|
| 55 |
+
import torch
|
| 56 |
+
import torchaudio
|
| 57 |
+
from transformers import AutoModel, AutoProcessor
|
| 58 |
+
# Disable the broken cuDNN SDPA backend
|
| 59 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 60 |
+
# Keep these enabled as fallbacks
|
| 61 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 62 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 63 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 64 |
+
|
| 65 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTSD-v1.0"
|
| 66 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 67 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 68 |
+
|
| 69 |
+
def resolve_attn_implementation() -> str:
|
| 70 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 71 |
+
if (
|
| 72 |
+
device == "cuda"
|
| 73 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 74 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 75 |
+
):
|
| 76 |
+
major, _ = torch.cuda.get_device_capability()
|
| 77 |
+
if major >= 8:
|
| 78 |
+
return "flash_attention_2"
|
| 79 |
+
|
| 80 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 81 |
+
if device == "cuda":
|
| 82 |
+
return "sdpa"
|
| 83 |
+
|
| 84 |
+
# CPU fallback.
|
| 85 |
+
return "eager"
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
attn_implementation = resolve_attn_implementation()
|
| 89 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 90 |
+
|
| 91 |
+
processor = AutoProcessor.from_pretrained(
|
| 92 |
+
pretrained_model_name_or_path,
|
| 93 |
+
trust_remote_code=True,
|
| 94 |
+
)
|
| 95 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 96 |
+
|
| 97 |
+
model = AutoModel.from_pretrained(
|
| 98 |
+
pretrained_model_name_or_path,
|
| 99 |
+
trust_remote_code=True,
|
| 100 |
+
attn_implementation=attn_implementation,
|
| 101 |
+
torch_dtype=dtype,
|
| 102 |
+
).to(device)
|
| 103 |
+
model.eval()
|
| 104 |
+
|
| 105 |
+
# --- Inputs ---
|
| 106 |
+
|
| 107 |
+
# Use audio from ./assets/audio to avoid downloading from the cloud.
|
| 108 |
+
prompt_audio_speaker1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_02_s1.wav"
|
| 109 |
+
prompt_audio_speaker2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_02_s2.wav"
|
| 110 |
+
prompt_text_speaker1 = "[S1] In short, we embarked on a mission to make America great again for all Americans."
|
| 111 |
+
prompt_text_speaker2 = "[S2] NVIDIA reinvented computing for the first time after 60 years. In fact, Erwin at IBM knows quite well that the computer has largely been the same since the 60s."
|
| 112 |
+
|
| 113 |
+
text_to_generate = "[S1] Listen, let's talk business. China. I'm hearing things. People are saying they're catching up. Fast. What's the real scoop? Their AI—is it a threat? [S2] Well, the pace of innovation there is extraordinary, honestly. They have the researchers, and they have the drive. [S1] Extraordinary? I don't like that. I want us to be extraordinary. Are they winning? [S2] I wouldn't say winning, but their progress is very promising. They are building massive clusters. They're very determined. [S1] Promising. There it is. I hate that word. When China is promising, it means we're losing. It's a disaster, Jensen. A total disaster. "
|
| 114 |
+
|
| 115 |
+
# --- Load & resample audio ---
|
| 116 |
+
|
| 117 |
+
target_sr = int(processor.model_config.sampling_rate)
|
| 118 |
+
wav1, sr1 = torchaudio.load(prompt_audio_speaker1)
|
| 119 |
+
wav2, sr2 = torchaudio.load(prompt_audio_speaker2)
|
| 120 |
+
|
| 121 |
+
if wav1.shape[0] > 1:
|
| 122 |
+
wav1 = wav1.mean(dim=0, keepdim=True)
|
| 123 |
+
if wav2.shape[0] > 1:
|
| 124 |
+
wav2 = wav2.mean(dim=0, keepdim=True)
|
| 125 |
+
if sr1 != target_sr:
|
| 126 |
+
wav1 = torchaudio.functional.resample(wav1, sr1, target_sr)
|
| 127 |
+
if sr2 != target_sr:
|
| 128 |
+
wav2 = torchaudio.functional.resample(wav2, sr2, target_sr)
|
| 129 |
+
|
| 130 |
+
# --- Build conversation ---
|
| 131 |
+
|
| 132 |
+
reference_audio_codes = processor.encode_audios_from_wav([wav1, wav2], sampling_rate=target_sr)
|
| 133 |
+
concat_prompt_wav = torch.cat([wav1, wav2], dim=-1)
|
| 134 |
+
prompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=target_sr)[0]
|
| 135 |
+
|
| 136 |
+
full_text = f"{prompt_text_speaker1} {prompt_text_speaker2} {text_to_generate}"
|
| 137 |
+
|
| 138 |
+
conversations = [
|
| 139 |
+
[
|
| 140 |
+
processor.build_user_message(
|
| 141 |
+
text=full_text,
|
| 142 |
+
reference=reference_audio_codes,
|
| 143 |
+
),
|
| 144 |
+
processor.build_assistant_message(
|
| 145 |
+
audio_codes_list=[prompt_audio]
|
| 146 |
+
),
|
| 147 |
+
],
|
| 148 |
+
]
|
| 149 |
+
|
| 150 |
+
# --- Inference ---
|
| 151 |
+
|
| 152 |
+
batch_size = 1
|
| 153 |
+
|
| 154 |
+
save_dir = Path("inference_root")
|
| 155 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 156 |
+
sample_idx = 0
|
| 157 |
+
with torch.no_grad():
|
| 158 |
+
for start in range(0, len(conversations), batch_size):
|
| 159 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 160 |
+
batch = processor(batch_conversations, mode="continuation")
|
| 161 |
+
input_ids = batch["input_ids"].to(device)
|
| 162 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 163 |
+
|
| 164 |
+
outputs = model.generate(
|
| 165 |
+
input_ids=input_ids,
|
| 166 |
+
attention_mask=attention_mask,
|
| 167 |
+
max_new_tokens=2000,
|
| 168 |
+
)
|
| 169 |
+
|
| 170 |
+
for message in processor.decode(outputs):
|
| 171 |
+
audio = message.audio_codes_list[0]
|
| 172 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 173 |
+
sample_idx += 1
|
| 174 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 175 |
+
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
### Input Types
|
| 179 |
+
|
| 180 |
+
**UserMessage**
|
| 181 |
+
|
| 182 |
+
| Field | Type | Required | Description |
|
| 183 |
+
|---|---|---:|---|
|
| 184 |
+
| `text` | `str` | Yes | Full dialogue text including speaker tags (`[S1]`, `[S2]`, ...) and prompt transcripts. |
|
| 185 |
+
| `reference` | `List` | Yes | Per-speaker reference audio codes from `processor.encode_audios_from_wav()`. |
|
| 186 |
+
|
| 187 |
+
**AssistantMessage**
|
| 188 |
+
|
| 189 |
+
| Field | Type | Required | Description |
|
| 190 |
+
|---|---|---:|---|
|
| 191 |
+
| `audio_codes_list` | `List` | Yes | Concatenated prompt audio codes for all speakers. |
|
| 192 |
+
|
| 193 |
+
### Generation Hyperparameters
|
| 194 |
+
|
| 195 |
+
| Parameter | Type | Default | Description |
|
| 196 |
+
|---|---|---:|---|
|
| 197 |
+
| `max_new_tokens` | `int` | — | Controls total generated audio tokens. **1s ≈ 12.5 tokens**. |
|
| 198 |
+
| `audio_temperature` | `float` | 1.1 | Higher values increase variation; lower values stabilize prosody. |
|
| 199 |
+
| `audio_top_p` | `float` | 0.9 | Nucleus sampling cutoff. |
|
| 200 |
+
| `audio_top_k` | `int` | 50 | Top-K sampling. |
|
| 201 |
+
| `audio_repetition_penalty` | `float` | 1.1 | >1.0 discourages repeating patterns. |
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
## 3. Evaluation
|
| 205 |
+
### Objective Evaluation(TTSD-eval)
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
We introduce a robust evaluation framework leveraging **MMS-FA** for alignment and **wespeaker** for embedding extraction to ensure precise speaker attribution.
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
- **Method**: Forced-alignment based segmentation + Similarity-based speaker verification.
|
| 214 |
+
|
| 215 |
+
- **Metrics**:
|
| 216 |
+
- **Speaker Attribution Accuracy (ACC)**
|
| 217 |
+
- **Speaker Similarity (SIM)**
|
| 218 |
+
- **Word Error Rate (WER)** computed using **Whisper-large-v3**.
|
| 219 |
+
|
| 220 |
+
- **Dataset**: 100 multi-turn dialogues (CN/EN) spanning 30s–720s. Covers diverse scenarios including Podcasts, TV dubbing, and Crosstalk.
|
| 221 |
+
|
| 222 |
+
Please refer to [TTSD-eval](https://github.com/OpenMOSS/TTSD-eval) for the code and data.
|
| 223 |
+
<br>
|
| 224 |
+
|
| 225 |
+
| Model | ZH - SIM | ZH - ACC | ZH - WER | EN - SIM | EN - ACC | EN - WER |
|
| 226 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 227 |
+
| **Comparison with Open-Source Models** | | | | | | |
|
| 228 |
+
| MOSS-TTSD | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |
|
| 229 |
+
| MOSS-TTSD v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |
|
| 230 |
+
| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |
|
| 231 |
+
| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |
|
| 232 |
+
| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |
|
| 233 |
+
| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |
|
| 234 |
+
| **Comparison with Proprietary Models** | | | | | | |
|
| 235 |
+
| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |
|
| 236 |
+
| MOSS-TTSD (elevenlabs_voice) | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |
|
| 237 |
+
| | | | | | | |
|
| 238 |
+
| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |
|
| 239 |
+
| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |
|
| 240 |
+
| MOSS-TTSD (gemini_voice) | - | - | - | **0.7893** | **0.9655** | 0.0984 |
|
| 241 |
+
| | | | | | | |
|
| 242 |
+
| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |
|
| 243 |
+
| MOSS-TTSD (doubao_voice) | **0.8226** | **0.9630** | 0.0571 | - | - | - |
|
| 244 |
+
|
| 245 |
+
### Subjective Evaluation
|
| 246 |
+
For open-source models, annotators are asked to score each sample pair in terms of speaker attribution accuracy, voice similarity, prosody, and overall quality. Following the methodology of the LMSYS Chatbot Arena, we compute Elo ratings and confidence intervals for each dimension.
|
| 247 |
+

|
| 248 |
+
|
| 249 |
+
For closed-source models, annotators are only asked to choose the overall preferred one in each pair, and we compute the win rate accordingly.
|
| 250 |
+

|
docs/moss_voice_generator_model_card.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-VoiceGenerator Model Card
|
| 2 |
+
|
| 3 |
+
**MOSS-VoiceGenerator** is an open-source voice generation system designed to enable the creation of custom speaker timbres from free-form textual descriptions. This model allows users to generate voices that reflect specific characters, personalities, and emotions. It is particularly notable for its ability to produce speech with natural-sounding emotional expressiveness, providing a realistic and nuanced listening experience. As an open-source tool, MOSS Voice Generator is suitable for a variety of applications, such as audiobooks, game dubbing, role-playing agents, and conversational assistants.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Overview
|
| 8 |
+
|
| 9 |
+
### 1.1 TTS Family Positioning
|
| 10 |
+
|
| 11 |
+
**MOSS-VoiceGenerator** is a high-fidelity voice design tool within the broader TTS Family. It specializes in crafting expressive and natural-sounding voices from textual descriptions. Unlike traditional TTS systems relying on predefined voices or reference audio, MOSS-VoiceGenerator enables zero-shot voice design, allowing for the creation of customized voices for a variety of applications, such as characters, audiobooks, games, or virtual assistants. Additionally, it can serve as a voice design layer for other TTS systems, addressing the challenge of finding suitable reference audio and improving integration and performance.
|
| 12 |
+
|
| 13 |
+
**Key Capabilities**
|
| 14 |
+
* **Highly expressive emotional delivery**: Aimed at generating voices with dynamic and nuanced emotional performances, allowing for natural shifts in tone, pace, and emotion.
|
| 15 |
+
* **Human-Like Naturalness** : Indistinguishable from real human speech with authentic breathing, pauses, and vocal nuances
|
| 16 |
+
* **Multilingual Support** : High-quality synthesis in Chinese and English
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
### 1.2 Model Architecture
|
| 21 |
+
**MOSS-VoiceGenerator** employs MossTTSDelay architecture (see [moss_tts_delay/README.md](../moss_tts_delay/README.md) for more details), where voice description instructions and the text to be synthesized are concatenated and jointly tokenized as input to drive speech generation, enabling unified modeling of timbre design, style control, and content synthesis. Through instruction-timbre alignment, the model learns the correspondence between textual descriptions and acoustic features, allowing it to generate high-fidelity speech with target timbre, emotion, and style directly from free-form text prompts—without requiring any reference audio.
|
| 22 |
+
|
| 23 |
+
### 1.3 Released Model
|
| 24 |
+
**Recommended decoding hyperparameters**
|
| 25 |
+
| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
|
| 26 |
+
|---|---:|---:|---:|---:|
|
| 27 |
+
| **MOSS-VoiceGenerator** | 1.5 | 0.6 | 50 | 1.1 |
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 2. Quick Start
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
import importlib.util
|
| 36 |
+
import torch
|
| 37 |
+
import torchaudio
|
| 38 |
+
from transformers import AutoModel, AutoProcessor
|
| 39 |
+
# Disable the broken cuDNN SDPA backend
|
| 40 |
+
torch.backends.cuda.enable_cudnn_sdp(False)
|
| 41 |
+
# Keep these enabled as fallbacks
|
| 42 |
+
torch.backends.cuda.enable_flash_sdp(True)
|
| 43 |
+
torch.backends.cuda.enable_mem_efficient_sdp(True)
|
| 44 |
+
torch.backends.cuda.enable_math_sdp(True)
|
| 45 |
+
|
| 46 |
+
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-VoiceGenerator"
|
| 47 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 48 |
+
dtype = torch.bfloat16 if device == "cuda" else torch.float32
|
| 49 |
+
|
| 50 |
+
def resolve_attn_implementation() -> str:
|
| 51 |
+
# Prefer FlashAttention 2 when package + device conditions are met.
|
| 52 |
+
if (
|
| 53 |
+
device == "cuda"
|
| 54 |
+
and importlib.util.find_spec("flash_attn") is not None
|
| 55 |
+
and dtype in {torch.float16, torch.bfloat16}
|
| 56 |
+
):
|
| 57 |
+
major, _ = torch.cuda.get_device_capability()
|
| 58 |
+
if major >= 8:
|
| 59 |
+
return "flash_attention_2"
|
| 60 |
+
|
| 61 |
+
# CUDA fallback: use PyTorch SDPA kernels.
|
| 62 |
+
if device == "cuda":
|
| 63 |
+
return "sdpa"
|
| 64 |
+
|
| 65 |
+
# CPU fallback.
|
| 66 |
+
return "eager"
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
attn_implementation = resolve_attn_implementation()
|
| 70 |
+
print(f"[INFO] Using attn_implementation={attn_implementation}")
|
| 71 |
+
|
| 72 |
+
processor = AutoProcessor.from_pretrained(
|
| 73 |
+
pretrained_model_name_or_path,
|
| 74 |
+
trust_remote_code=True,
|
| 75 |
+
normalize_inputs=True, # normalize text and instruction input
|
| 76 |
+
)
|
| 77 |
+
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
# ====== Batch demo ======
|
| 81 |
+
text1="哎呀,我的老腰啊,这年纪大了就是不行了。"
|
| 82 |
+
instruction1="疲惫沙哑的老年声音缓慢抱怨,带有轻微呻吟。"
|
| 83 |
+
|
| 84 |
+
text2="亲爱的观众们,今天我要为大家做一道传说中的龙须面,这道面条细如发丝,需要极其精湛的手艺才能制作成功,请大家仔细观看我的每一个动作。"
|
| 85 |
+
instruction2="热情的美食节目主持人,语调生动活泼,充满对美食的热爱和专业精神。"
|
| 86 |
+
|
| 87 |
+
text3="Hey there, stranger! What brings you to our humble town? Looking for a good drink or a tall tale?"
|
| 88 |
+
instruction3="Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff, friendly tone in American English, radiating warmth and hospitality."
|
| 89 |
+
|
| 90 |
+
text4="The quick brown fox jumps over the lazy dog."
|
| 91 |
+
instruction4="Clear, neutral voice for phonetic practice, even tempo and precise articulation in standard American English, emphasizing clarity of each word."
|
| 92 |
+
|
| 93 |
+
conversations = [
|
| 94 |
+
[processor.build_user_message(text=text1, instruction=instruction1)],
|
| 95 |
+
[processor.build_user_message(text=text2, instruction=instruction2)],
|
| 96 |
+
[processor.build_user_message(text=text3, instruction=instruction3)],
|
| 97 |
+
[processor.build_user_message(text=text4, instruction=instruction4)],
|
| 98 |
+
]
|
| 99 |
+
|
| 100 |
+
model = AutoModel.from_pretrained(
|
| 101 |
+
pretrained_model_name_or_path,
|
| 102 |
+
trust_remote_code=True,
|
| 103 |
+
attn_implementation=attn_implementation,
|
| 104 |
+
torch_dtype=dtype,
|
| 105 |
+
).to(device)
|
| 106 |
+
model.eval()
|
| 107 |
+
|
| 108 |
+
batch_size = 1
|
| 109 |
+
|
| 110 |
+
save_dir = Path("inference_root")
|
| 111 |
+
save_dir.mkdir(exist_ok=True, parents=True)
|
| 112 |
+
sample_idx = 0
|
| 113 |
+
with torch.no_grad():
|
| 114 |
+
for start in range(0, len(conversations), batch_size):
|
| 115 |
+
batch_conversations = conversations[start : start + batch_size]
|
| 116 |
+
batch = processor(batch_conversations, mode="generation")
|
| 117 |
+
input_ids = batch["input_ids"].to(device)
|
| 118 |
+
attention_mask = batch["attention_mask"].to(device)
|
| 119 |
+
|
| 120 |
+
outputs = model.generate(
|
| 121 |
+
input_ids=input_ids,
|
| 122 |
+
attention_mask=attention_mask,
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
for message in processor.decode(outputs):
|
| 126 |
+
audio = message.audio_codes_list[0]
|
| 127 |
+
out_path = save_dir / f"sample{sample_idx}.wav"
|
| 128 |
+
sample_idx += 1
|
| 129 |
+
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
|
| 130 |
+
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Input Types
|
| 134 |
+
|
| 135 |
+
**UserMessage**
|
| 136 |
+
|
| 137 |
+
| Field | Type | Required | Description |
|
| 138 |
+
|---|---|---:|---|
|
| 139 |
+
| `text` | `str` | Yes | Text to synthesize. Supports Chinese and English. |
|
| 140 |
+
| `instruction` | `str` | Yes | Specify the style or the synthesized speech. Users can provide detailed speech style instructions, such as emotion, speed, pitch, and voice characteristics. |
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
### Generation Hyperparameters
|
| 144 |
+
|
| 145 |
+
| Parameter | Type | Default | Description |
|
| 146 |
+
|---|---|---:|---|
|
| 147 |
+
| `audio_temperature` | `float` | 1.5 | Higher values increase variation; lower values stabilize prosody. |
|
| 148 |
+
| `audio_top_p` | `float` | 0.6 | Nucleus sampling cutoff. Lower values are more conservative. |
|
| 149 |
+
| `audio_top_k` | `int` | 50 | Top-K sampling. Lower values tighten sampling space. |
|
| 150 |
+
| `audio_repetition_penalty` | `float` | 1.1 | >1.0 discourages repeating patterns. |
|
| 151 |
+
|
| 152 |
+
> Note: MOSS-VoiceGenerator is **sensitive to decoding hyperparameters**. See **Released Models** for recommended defaults.
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
## 3. Performance
|
| 159 |
+
|
| 160 |
+
MOSS-VoiceGenerator demonstrates significant advantages in subjective evaluation. Using 160 internal test samples covering diverse voice styles, we established three independent evaluation dimensions: (1) **Overall Preference** - Which voice would you choose? (2) **Instruction Following** - Which audio best follows the instructions (gender, age, tone, emotion, accent, speed)? (3) **Naturalness** - Which audio sounds most like real human speech? Results show that **MOSS-VoiceGenerator outperforms all TTS systems** that support zero predefined voices and customizable preview text across these three dimensions.
|
| 161 |
+

|
moss_tts_delay/README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)
|
| 2 |
+
|
| 3 |
+
This document details the **MossTTSDelay** architecture, the production-grade variant of the MOSS-TTS family. It employs a **Single Transformer** backbone with **Multi-Head Parallel Prediction** and a **Delay-Pattern** scheduling mechanism to achieve high-speed, stable, and long-form speech synthesis. The architecture diagram is shown in the figure.
|
| 4 |
+
|
| 5 |
+
<p align="center">
|
| 6 |
+
<img src="../assets/archi_delay.png" width="60%" />
|
| 7 |
+
</p>
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Overview: Parallel Heads + Delay Pattern
|
| 12 |
+
|
| 13 |
+
Unlike the **MossTTSLocal** architecture which uses a hierarchical "Temporal + Depth" approach, **MossTTSDelay** integrates all modeling into a single large-scale Transformer. It achieves efficient multi-codebook modeling by shifting the RVQ layers in the time domain, allowing the model to predict all codebook layers for a given step simultaneously through multiple linear heads.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
### Key Components
|
| 17 |
+
* **Unified Transformer Backbone:** A large-scale language model (based on the **Qwen-8B** scale) that handles text encoding, prosody modeling, and audio token prediction in a single forward pass.
|
| 18 |
+
* **Multi-Head Output Layer:** The backbone is equipped with **$1 + N_q$** (where $N_q=32$) prediction heads. One head manages the primary sequence logic, while the other 32 heads parallelly predict the RVQ codebook layers.
|
| 19 |
+
* **Delay-Pattern Scheduling:** A specialized data formatting technique that introduces a 1-step offset between successive RVQ layers. This enables causal dependency modeling across codebook depths without requiring an additional "Depth Transformer."
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 2. Technical Specifications
|
| 24 |
+
|
| 25 |
+
| Feature | Specification |
|
| 26 |
+
| :--- | :--- |
|
| 27 |
+
| **Backbone Model** | Initialized from **Qwen-8B** scale |
|
| 28 |
+
| **Prediction Heads** | **33 LM Heads** (1 Main + 32 RVQ Heads) |
|
| 29 |
+
| **Audio Tokenizer** | **Cat** (Causal Audio Tokenizer) |
|
| 30 |
+
| **Sampling Rate** | 24,000 Hz |
|
| 31 |
+
| **Frame Rate** | 12.5 Hz (1s ≈ 12.5 tokens) |
|
| 32 |
+
| **Codebooks** | 32 RVQ layers (10-bit each) |
|
| 33 |
+
| **Generation Mode** | Parallel Autoregressive (Delay-Pattern) |
|
| 34 |
+
| **Primary Advantage** | Inference speed & Long-context stability |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 3. Core Mechanism: Multi-Head Parallel Prediction
|
| 39 |
+
|
| 40 |
+
The defining characteristic of MossTTSDelay is its **computational efficiency**. By attaching 32 independent linear heads to the final hidden state of the Transformer backbone, the model can generate an entire frame's worth of multi-layer RVQ tokens in a **single forward step**.
|
| 41 |
+
|
| 42 |
+
### Why this is faster than MossTTSLocal:
|
| 43 |
+
* **No Nested Loops:** While the Local architecture requires a secondary "Local Transformer" to iterate through each RVQ layer within one time step, MossTTSDelay computes all layers in parallel.
|
| 44 |
+
* **Direct Projection:** The relationship between codebook layers is captured by the backbone's internal representations and the delay-pattern, removing the latency overhead of a dedicated depth-modeling module.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## 4. Prediction Topology: Delay-Pattern
|
| 49 |
+
|
| 50 |
+
To maintain the hierarchical dependency of RVQ (where Layer $k$ should ideally "see" the information from Layer $k-1$), MossTTSDelay uses **Delay-Pattern Scheduling**.
|
| 51 |
+
|
| 52 |
+
**The Pattern:**
|
| 53 |
+
At each training or inference step $t$, the input sequence is structured such that:
|
| 54 |
+
* Head 1 predicts Layer 1 of Frame $t$.
|
| 55 |
+
* Head 2 predicts Layer 2 of Frame $t-1$.
|
| 56 |
+
* Head 3 predicts Layer 3 of Frame $t-2$.
|
| 57 |
+
* ... and so on.
|
| 58 |
+
|
| 59 |
+
**Dependency Modeling:**
|
| 60 |
+
Because the Transformer is causal, when the model predicts tokens for "Step $t$", it has already seen the tokens from "Step $t-1$" in its context. Due to the 1-step shift, the information for Layer $k-1$ (at Step $t$) is already present in the history when the model predicts Layer $k$ (at Step $t+1$). This "diagonal" dependency effectively models the coarse-to-fine structure of the audio tokenizer.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## 5. Evaluation & Performance
|
| 65 |
+
|
| 66 |
+
According to the `moss_tts_model_card.md`, the **MossTTSDelay-8B** is the recommended model for production and long-form stability:
|
| 67 |
+
|
| 68 |
+
| Metric | Result (Seed-TTS-Eval) |
|
| 69 |
+
| :--- | :--- |
|
| 70 |
+
| **EN SIM (Speaker Similarity)** | **0.7146** |
|
| 71 |
+
| **ZH SIM (Speaker Similarity)** | **0.7705** |
|
| 72 |
+
| **EN WER (Word Error Rate)** | **1.79%** |
|
| 73 |
+
| **ZH CER (Char Error Rate)** | **1.32%** |
|
| 74 |
+
|
| 75 |
+
**Conclusion:** MossTTSDelay offers superior long-context stability and faster inference speeds compared to the Local variant. Its 8B parameter scale provides the capacity needed for complex prosody and ultra-long (up to 1 hour) speech generation.
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## 6. Architecture Comparison
|
| 80 |
+
|
| 81 |
+
| Aspect | MossTTSDelay (Architecture A) | MossTTSLocal (Architecture B) |
|
| 82 |
+
| :--- | :--- | :--- |
|
| 83 |
+
| **Structure** | Single Transformer (8B) | Temporal + Depth Transformers (1.7B) |
|
| 84 |
+
| **Scheduling** | **Delay-Pattern (Diagonal Shift)** | Per-step Synchronous Blocks |
|
| 85 |
+
| **Prediction Heads** | **33 Parallel Heads** | Single Latent Head + Local Module |
|
| 86 |
+
| **Inference Speed** | **High** (Parallel RVQ prediction) | Moderate (Sequential RVQ prediction) |
|
| 87 |
+
| **Stability** | Excellent for long-form (1h+) | Optimized for short-segment metrics |
|
| 88 |
+
| **Best For** | Production, Scalable Apps, Narration | Research, Quality Benchmarks |
|
| 89 |
+
|
| 90 |
+
---
|