Configuration Parsing Warning:Invalid JSON for config file config.json
XTTS-v2 Romanian v2
Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 12.4% WER (measured by Whisper large-v3) across 15 distinct voices trained on ~471K clips from ~470 speakers.
This is the successor to XTTS-v2 Romanian v1, with 3x more voices and dramatically more speaker diversity.
Audio Samples
All samples below were generated with 0% WER (perfectly transcribed by Whisper large-v3).
Original Voices (from v1 dataset)
Costel (male, literary narration) — 9.9% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Mărioara (female, expressive storytelling) — 4.7% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Georgel (male, solemn delivery) — 8.4% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Lăcrămioara (female, clear broadcast style) — 12.0% avg WER
"Un vultur stă pe pisc cu un pix în plisc."
Dorel (male, conversational) — 13.9% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
DS1 Voices (from romanian-speech-v2)
Adrian (male) — 16.0% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Ciprian (male) — 9.5% avg WER
"Bună ziua, mă numesc Alexandru și sunt din București."
Mihai (male) — 13.4% avg WER
"Bună ziua, mă numesc Alexandra și sunt din Cluj-Napoca."
Raluca (female) — 14.0% avg WER
"Fișierele și rețelele informatice sunt esențiale în științele moderne."
Vasile (male) — 12.4% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
DS2 Voices (from TTS-Romanian)
Elena (female) — 18.6% avg WER
"Bună ziua, mă numesc Alexandru și sunt din București."
Ioana (female) — 17.6% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Ana (female) — 14.5% avg WER
"Ce-ntâmplare întâmplăreață s-a-ntâmplat în tâmplărie, un tâmplar din întâmplare s-a lovit cu tâmpla-n cap."
Cristina (female) — 11.0% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Ion (male) — 10.9% avg WER
"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."
Comparison with v1
| v1 (Run 4, epoch 45) | v2 (Run 5, epoch 15) | |
|---|---|---|
| Overall WER | 6.3% | 12.4% |
| Voices | 5 | 15 |
| Training speakers | 5 | ~470 |
| Training clips | ~62K | ~471K |
| Training hours | ~150h | ~700h+ |
| Test sentences | 10 (standard) | 9 (incl. tongue twisters) |
| Normal sentence WER | ~6% | 0-4% |
Why is v2's overall WER higher? The test set includes 4 tongue twisters (s05-s08) that are deliberately challenging. On normal conversational sentences (s01-s04), v2 achieves 0-3% WER — comparable to or better than v1. The tongue twisters (s06 "Un vultur stă pe pisc...", s08 "Cărămidarul cărămidărește...") push the average up to 12.4%.
v2's key advantage is speaker diversity — trained on ~470 speakers vs 5, it generalizes much better to unseen voices during voice cloning.
Quick Start
Installation
pip install TTS==0.22.0
Note: TTS 0.22.0 requires patches for PyTorch 2.x compatibility and Romanian tokenizer support. See
scripts/setup_runpod.shfor the exact patches needed.
Download Model
git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian-v2
cd xtts-v2-romanian-v2
Or download individual files:
from huggingface_hub import hf_hub_download
for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
hf_hub_download(repo_id="eduardem/xtts-v2-romanian-v2", filename=fname, local_dir="xtts-v2-romanian-v2")
Basic Inference
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
"\u015f": "\u0219", # ş -> ș (lowercase s)
"\u0163": "\u021b", # ţ -> ț (lowercase t)
"\u015e": "\u0218", # Ş -> Ș (uppercase S)
"\u0162": "\u021a", # Ţ -> Ț (uppercase T)
})
# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian-v2", use_deepspeed=False)
model.cuda()
# REQUIRED: Increase Romanian character limit (default is too low)
model.tokenizer.char_limits["ro"] = 250
# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)
# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["xtts-v2-romanian-v2/reference_voices/costel.wav"]
)
# Calculate max_new_tokens to prevent hallucination
word_count = len(text.split())
max_gen_tokens = max(min(word_count * 50, 500), 150)
# Generate speech
out = model.inference(
text=text,
language="ro",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.3,
top_p=0.7,
top_k=30,
length_penalty=0.8,
repetition_penalty=10.0,
enable_text_splitting=True,
max_new_tokens=max_gen_tokens,
)
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
Long Text Generation
For texts longer than ~30 words, always use enable_text_splitting=True:
long_text = """
România este o țară situată în sud-estul Europei. Capitala și cel mai mare
oraș este București. Țara are o populație de aproximativ nouăsprezece
milioane de locuitori și o suprafață de două sute patruzeci de mii de
kilometri pătrați.
""".strip().translate(CEDILLA_TO_COMMA)
out = model.inference(
text=long_text,
language="ro",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.3,
top_p=0.7,
top_k=30,
length_penalty=0.8,
repetition_penalty=10.0,
enable_text_splitting=True,
)
Trimming Trailing Silence
XTTS-v2 sometimes generates trailing silence. Use this helper to trim it:
import numpy as np
def trim_trailing_silence(wav, sr=24000, threshold_db=-40, window_ms=25, margin_ms=50):
"""Trim trailing silence from generated audio."""
if isinstance(wav, torch.Tensor):
wav_np = wav.cpu().numpy()
else:
wav_np = np.array(wav)
if wav_np.ndim > 1:
wav_np = wav_np.squeeze()
window = int(sr * window_ms / 1000)
margin = int(sr * margin_ms / 1000)
threshold = 10 ** (threshold_db / 20)
# Scan backwards to find last non-silent frame
for i in range(len(wav_np) - window, 0, -window):
rms = np.sqrt(np.mean(wav_np[i:i+window] ** 2))
if rms > threshold:
end = min(i + window + margin, len(wav_np))
return torch.tensor(wav_np[:end]) if isinstance(wav, torch.Tensor) else wav_np[:end]
return wav
Voice Cloning with Your Own Voice
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["my_voice.wav"] # WAV, ~6 seconds, clear speech
)
out = model.inference(
text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
language="ro",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
)
Known Issue: Stop Token Behavior
XTTS-v2 has no explicit stop token for Romanian (it wasn't in the original 16-language set). This causes two potential issues:
- Hallucination — Without
max_new_tokens, the model may generate extra speech after the intended text ends (repeating phrases, adding unrelated words, or producing babble). - Truncation — With too-aggressive
max_new_tokens, long texts may be cut short.
Recommended mitigations:
| Strategy | How |
|---|---|
| Dynamic token limit | max_new_tokens = max(min(word_count * 50, 500), 150) |
| Text splitting | enable_text_splitting=True for texts > 30 words |
| Silence trimming | Use trim_trailing_silence() on output |
| Duration check | Flag outputs > 2x expected duration for re-generation |
Voices
Fifteen voices are included with reference audio clips in the reference_voices/ directory.
Original Dataset Voices
| Voice | Speaker ID | Gender | Description | WER |
|---|---|---|---|---|
| Costel | speaker_male_literature |
M | Literary narration | 9.9% |
| Mărioara | speaker_female_hp |
F | Expressive storytelling | 4.7% |
| Lăcrămioara | speaker_female_adr |
F | Clear broadcast style | 12.0% |
| Georgel | speaker_male_bible |
M | Solemn delivery | 8.4% |
| Dorel | speaker_male_4 |
M | Conversational | 13.9% |
DS1 Voices (romanian-speech-v2)
| Voice | Speaker ID | Gender | WER |
|---|---|---|---|
| Adrian | ds1_Adrian |
M | 16.0% |
| Ciprian | ds1_Ciprian |
M | 9.5% |
| Mihai | ds1_Mihai |
M | 13.4% |
| Raluca | ds1_Raluca |
F | 14.0% |
| Vasile | ds1_Vasile |
M | 12.4% |
DS2 Voices (TTS-Romanian / datadriven-company)
| Voice | Speaker ID | Gender | WER |
|---|---|---|---|
| Elena | ds2_cartia_100 |
F | 18.6% |
| Ioana | ds2_cartia_142 |
F | 17.6% |
| Ana | ds2_cartia_254 |
F | 14.5% |
| Cristina | ds2_cartia_486 |
F | 11.0% |
| Ion | ds2_cartia_490 |
M | 10.9% |
WER is measured per-voice across 9 test sentences (including 4 tongue twisters) using Whisper large-v3.
Training
Datasets
| Dataset | Speakers | Clips | Hours | Source |
|---|---|---|---|---|
| eduardem/romanian-speech-v2 | ~21 | ~247K | ~350h | Audiobooks, broadcasts |
| datadriven-company/TTS-Romanian | ~456 | ~264K | ~350h | Read speech corpus |
| Combined | ~470 | ~471K | ~700h |
Infrastructure
- GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
- Training time: ~15 epochs of Phase 2
- Framework:
TTS==0.22.0+trainer==0.0.36
Two-Phase Training Strategy
Phase 1 — Embedding Warmup (1 epoch)
- Freeze all GPT layers; train only
text_embeddingandtext_pos_embedding(1.4% of parameters) - Learning rate: 1e-4
- Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț,
[ro]) from random initialization to a meaningful representation
Phase 2 — Full GPT Fine-Tune (15 epochs)
- Unfreeze all GPT layers
- Learning rate: 5e-6
- Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
- Optimizer: AdamW (betas=0.9, 0.96)
text_ce_weight=0.01(auxiliary regularizer, not primary objective)gpt_use_masking_gt_prompt_approach=True(prevents reference audio parroting)
Critical Training Parameters
These values are non-negotiable for XTTS-v2 fine-tuning:
- Effective batch size >= 252 — the official Coqui recipe minimum. Smaller batches cause mode collapse.
- text_ce_weight = 0.01 — increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
- gpt_use_masking_gt_prompt_approach = True — without this, the model learns to copy the reference audio instead of conditioning on input text.
Evaluation
Per-Sentence WER (run5_epoch15, averaged across 15 voices)
| Sentence | WER | Type | Text |
|---|---|---|---|
| s03 | 0.0% | Conversational | Bună ziua, mă numesc Alexandru și sunt din București. |
| s01 | 1.2% | Historical | Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă. |
| s04 | 2.7% | Conversational | Bună ziua, mă numesc Alexandra și sunt din Cluj-Napoca. |
| s05 | 9.3% | Tongue twister | S-a suit capra pe piatră... |
| s07 | 11.0% | Tongue twister | Ce-ntâmplare întâmplăreață... |
| s02 | 15.6% | Technical | Fișierele și rețelele informatice sunt esențiale în științele moderne. |
| s09 | 21.6% | Long passage | Ne trebuie împăcarea generațiilor... |
| s08 | 24.8% | Tongue twister | Cărămidarul cărămidărește cu cărămida cărămidarului... |
| s06 | 26.0% | Tongue twister | Un vultur stă pe pisc cu un pix în plisc. |
Error Patterns
The dominant error type is diacritics dropping — the model occasionally produces the correct word but without diacritics (e.g., "manastiri" instead of "mănăstiri"). This accounts for the majority of WER errors, especially on diacritics-heavy sentences. Truncation, hallucination, and repetition are rare (<10 occurrences each across 135 samples).
Checkpoint Comparison
| Checkpoint | Avg WER | Notes |
|---|---|---|
| Base model (no fine-tuning) | 84.1% | Romanian not supported |
| Run 5, epoch 5 | 20.4% | Early training |
| Run 5, epoch 10 | 14.2% | Improving |
| Run 4, epoch 45 (v1) | 13.8% | Previous best, 5 voices |
| Run 5, epoch 15 (v2) | 12.4% | Best overall, 15 voices |
Inference Parameters
| Parameter | Recommended | Description |
|---|---|---|
temperature |
0.3 | Sampling temperature. Lower = more deterministic. |
top_p |
0.7 | Nucleus sampling threshold. |
top_k |
30 | Top-k sampling. |
length_penalty |
0.8 | Values <1.0 discourage overly long output. |
repetition_penalty |
10.0 | Penalizes repeated tokens. |
enable_text_splitting |
True | Split long texts at sentence boundaries. |
max_new_tokens |
dynamic | max(min(word_count * 50, 500), 150) |
Model Files
| File | Size | Description |
|---|---|---|
config.json |
4 KB | XTTS-v2 configuration |
model.pth |
2.1 GB | Fine-tuned model weights (Run 5, epoch 15) |
dvae.pth |
211 MB | Discrete VAE for mel-spectrogram tokenization |
mel_stats.pth |
1 KB | Mel-spectrogram normalization statistics |
vocab.json |
264 KB | Extended vocabulary with Romanian diacritics + [ro] |
speakers_xtts.pth |
8 MB | Speaker embedding defaults |
reference_voices/ |
~5 MB | ~6s WAV clips for each of the 15 voices |
samples/ |
~4 MB | One generated sample per voice |
scripts/ |
~80 KB | Training, evaluation, and setup scripts |
Limitations
- Stop token issue — Romanian was not in XTTS-v2's original language set, so there is no explicit stop token. Use
max_new_tokensandenable_text_splittingto mitigate hallucination/truncation (see Known Issue section above). - Cedilla normalization is mandatory — forgetting to normalize input text will silently degrade output quality for any word containing ș or ț.
- Tongue twisters remain challenging — rapid repetitive syllables cause truncation or diacritics dropping (24-26% WER on the hardest test sentences).
- Whisper-based evaluation — WER is measured by Whisper large-v3, which may have its own biases on Romanian. No human evaluation has been conducted.
- Library patches required — TTS 0.22.0 needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See
scripts/setup_runpod.sh.
License
This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.
Attribution
- XTTS-v2 base model: Coqui AI / TTS
- Training data: eduardem/romanian-speech-v2, datadriven-company/TTS-Romanian
- Whisper evaluation: OpenAI Whisper
- v1 model: eduardem/xtts-v2-romanian
Citation
@misc{musat2026xttsromanian2,
title={Fine-tuning XTTS-v2 for Romanian v2: Multi-Speaker Training with 470 Speakers},
author={Musat, Eduard},
year={2026},
url={https://huggingface.co/eduardem/xtts-v2-romanian-v2}
}
Links
- v1 model — 5 voices, 6.3% WER, 150 hours
- Training dataset (private) — ~247K clips, ~21 speakers
- TTS-Romanian dataset (public) — ~264K clips, ~456 speakers
- Downloads last month
- 30
Model tree for eduardem/xtts-v2-romanian-v2
Base model
coqui/XTTS-v2