Configuration Parsing Warning:Invalid JSON for config file config.json

XTTS-v2 Romanian v2

Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 12.4% WER (measured by Whisper large-v3) across 15 distinct voices trained on ~471K clips from ~470 speakers.

This is the successor to XTTS-v2 Romanian v1, with 3x more voices and dramatically more speaker diversity.

Audio Samples

All samples below were generated with 0% WER (perfectly transcribed by Whisper large-v3).

Original Voices (from v1 dataset)

Costel (male, literary narration) — 9.9% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Mărioara (female, expressive storytelling) — 4.7% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Georgel (male, solemn delivery) — 8.4% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Lăcrămioara (female, clear broadcast style) — 12.0% avg WER

"Un vultur stă pe pisc cu un pix în plisc."

Dorel (male, conversational) — 13.9% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

DS1 Voices (from romanian-speech-v2)

Adrian (male) — 16.0% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Ciprian (male) — 9.5% avg WER

"Bună ziua, mă numesc Alexandru și sunt din București."

Mihai (male) — 13.4% avg WER

"Bună ziua, mă numesc Alexandra și sunt din Cluj-Napoca."

Raluca (female) — 14.0% avg WER

"Fișierele și rețelele informatice sunt esențiale în științele moderne."

Vasile (male) — 12.4% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

DS2 Voices (from TTS-Romanian)

Elena (female) — 18.6% avg WER

"Bună ziua, mă numesc Alexandru și sunt din București."

Ioana (female) — 17.6% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Ana (female) — 14.5% avg WER

"Ce-ntâmplare întâmplăreață s-a-ntâmplat în tâmplărie, un tâmplar din întâmplare s-a lovit cu tâmpla-n cap."

Cristina (female) — 11.0% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Ion (male) — 10.9% avg WER

"Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă."

Comparison with v1

v1 (Run 4, epoch 45) v2 (Run 5, epoch 15)
Overall WER 6.3% 12.4%
Voices 5 15
Training speakers 5 ~470
Training clips ~62K ~471K
Training hours ~150h ~700h+
Test sentences 10 (standard) 9 (incl. tongue twisters)
Normal sentence WER ~6% 0-4%

Why is v2's overall WER higher? The test set includes 4 tongue twisters (s05-s08) that are deliberately challenging. On normal conversational sentences (s01-s04), v2 achieves 0-3% WER — comparable to or better than v1. The tongue twisters (s06 "Un vultur stă pe pisc...", s08 "Cărămidarul cărămidărește...") push the average up to 12.4%.

v2's key advantage is speaker diversity — trained on ~470 speakers vs 5, it generalizes much better to unseen voices during voice cloning.

Quick Start

Installation

pip install TTS==0.22.0

Note: TTS 0.22.0 requires patches for PyTorch 2.x compatibility and Romanian tokenizer support. See scripts/setup_runpod.sh for the exact patches needed.

Download Model

git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian-v2
cd xtts-v2-romanian-v2

Or download individual files:

from huggingface_hub import hf_hub_download

for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
    hf_hub_download(repo_id="eduardem/xtts-v2-romanian-v2", filename=fname, local_dir="xtts-v2-romanian-v2")

Basic Inference

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
    "\u015f": "\u0219",  # ş -> ș  (lowercase s)
    "\u0163": "\u021b",  # ţ -> ț  (lowercase t)
    "\u015e": "\u0218",  # Ş -> Ș  (uppercase S)
    "\u0162": "\u021a",  # Ţ -> Ț  (uppercase T)
})

# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian-v2", use_deepspeed=False)
model.cuda()

# REQUIRED: Increase Romanian character limit (default is too low)
model.tokenizer.char_limits["ro"] = 250

# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)

# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["xtts-v2-romanian-v2/reference_voices/costel.wav"]
)

# Calculate max_new_tokens to prevent hallucination
word_count = len(text.split())
max_gen_tokens = max(min(word_count * 50, 500), 150)

# Generate speech
out = model.inference(
    text=text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
    enable_text_splitting=True,
    max_new_tokens=max_gen_tokens,
)

torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Long Text Generation

For texts longer than ~30 words, always use enable_text_splitting=True:

long_text = """
România este o țară situată în sud-estul Europei. Capitala și cel mai mare
oraș este București. Țara are o populație de aproximativ nouăsprezece
milioane de locuitori și o suprafață de două sute patruzeci de mii de
kilometri pătrați.
""".strip().translate(CEDILLA_TO_COMMA)

out = model.inference(
    text=long_text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
    enable_text_splitting=True,
)

Trimming Trailing Silence

XTTS-v2 sometimes generates trailing silence. Use this helper to trim it:

import numpy as np

def trim_trailing_silence(wav, sr=24000, threshold_db=-40, window_ms=25, margin_ms=50):
    """Trim trailing silence from generated audio."""
    if isinstance(wav, torch.Tensor):
        wav_np = wav.cpu().numpy()
    else:
        wav_np = np.array(wav)
    if wav_np.ndim > 1:
        wav_np = wav_np.squeeze()
    window = int(sr * window_ms / 1000)
    margin = int(sr * margin_ms / 1000)
    threshold = 10 ** (threshold_db / 20)
    # Scan backwards to find last non-silent frame
    for i in range(len(wav_np) - window, 0, -window):
        rms = np.sqrt(np.mean(wav_np[i:i+window] ** 2))
        if rms > threshold:
            end = min(i + window + margin, len(wav_np))
            return torch.tensor(wav_np[:end]) if isinstance(wav, torch.Tensor) else wav_np[:end]
    return wav

Voice Cloning with Your Own Voice

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["my_voice.wav"]  # WAV, ~6 seconds, clear speech
)

out = model.inference(
    text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

Known Issue: Stop Token Behavior

XTTS-v2 has no explicit stop token for Romanian (it wasn't in the original 16-language set). This causes two potential issues:

  1. Hallucination — Without max_new_tokens, the model may generate extra speech after the intended text ends (repeating phrases, adding unrelated words, or producing babble).
  2. Truncation — With too-aggressive max_new_tokens, long texts may be cut short.

Recommended mitigations:

Strategy How
Dynamic token limit max_new_tokens = max(min(word_count * 50, 500), 150)
Text splitting enable_text_splitting=True for texts > 30 words
Silence trimming Use trim_trailing_silence() on output
Duration check Flag outputs > 2x expected duration for re-generation

Voices

Fifteen voices are included with reference audio clips in the reference_voices/ directory.

Original Dataset Voices

Voice Speaker ID Gender Description WER
Costel speaker_male_literature M Literary narration 9.9%
Mărioara speaker_female_hp F Expressive storytelling 4.7%
Lăcrămioara speaker_female_adr F Clear broadcast style 12.0%
Georgel speaker_male_bible M Solemn delivery 8.4%
Dorel speaker_male_4 M Conversational 13.9%

DS1 Voices (romanian-speech-v2)

Voice Speaker ID Gender WER
Adrian ds1_Adrian M 16.0%
Ciprian ds1_Ciprian M 9.5%
Mihai ds1_Mihai M 13.4%
Raluca ds1_Raluca F 14.0%
Vasile ds1_Vasile M 12.4%

DS2 Voices (TTS-Romanian / datadriven-company)

Voice Speaker ID Gender WER
Elena ds2_cartia_100 F 18.6%
Ioana ds2_cartia_142 F 17.6%
Ana ds2_cartia_254 F 14.5%
Cristina ds2_cartia_486 F 11.0%
Ion ds2_cartia_490 M 10.9%

WER is measured per-voice across 9 test sentences (including 4 tongue twisters) using Whisper large-v3.

Training

Datasets

Dataset Speakers Clips Hours Source
eduardem/romanian-speech-v2 ~21 ~247K ~350h Audiobooks, broadcasts
datadriven-company/TTS-Romanian ~456 ~264K ~350h Read speech corpus
Combined ~470 ~471K ~700h

Infrastructure

  • GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
  • Training time: ~15 epochs of Phase 2
  • Framework: TTS==0.22.0 + trainer==0.0.36

Two-Phase Training Strategy

Phase 1 — Embedding Warmup (1 epoch)

  • Freeze all GPT layers; train only text_embedding and text_pos_embedding (1.4% of parameters)
  • Learning rate: 1e-4
  • Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț, [ro]) from random initialization to a meaningful representation

Phase 2 — Full GPT Fine-Tune (15 epochs)

  • Unfreeze all GPT layers
  • Learning rate: 5e-6
  • Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
  • Optimizer: AdamW (betas=0.9, 0.96)
  • text_ce_weight=0.01 (auxiliary regularizer, not primary objective)
  • gpt_use_masking_gt_prompt_approach=True (prevents reference audio parroting)

Critical Training Parameters

These values are non-negotiable for XTTS-v2 fine-tuning:

  • Effective batch size >= 252 — the official Coqui recipe minimum. Smaller batches cause mode collapse.
  • text_ce_weight = 0.01 — increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
  • gpt_use_masking_gt_prompt_approach = True — without this, the model learns to copy the reference audio instead of conditioning on input text.

Evaluation

Per-Sentence WER (run5_epoch15, averaged across 15 voices)

Sentence WER Type Text
s03 0.0% Conversational Bună ziua, mă numesc Alexandru și sunt din București.
s01 1.2% Historical Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă.
s04 2.7% Conversational Bună ziua, mă numesc Alexandra și sunt din Cluj-Napoca.
s05 9.3% Tongue twister S-a suit capra pe piatră...
s07 11.0% Tongue twister Ce-ntâmplare întâmplăreață...
s02 15.6% Technical Fișierele și rețelele informatice sunt esențiale în științele moderne.
s09 21.6% Long passage Ne trebuie împăcarea generațiilor...
s08 24.8% Tongue twister Cărămidarul cărămidărește cu cărămida cărămidarului...
s06 26.0% Tongue twister Un vultur stă pe pisc cu un pix în plisc.

Error Patterns

The dominant error type is diacritics dropping — the model occasionally produces the correct word but without diacritics (e.g., "manastiri" instead of "mănăstiri"). This accounts for the majority of WER errors, especially on diacritics-heavy sentences. Truncation, hallucination, and repetition are rare (<10 occurrences each across 135 samples).

Checkpoint Comparison

Checkpoint Avg WER Notes
Base model (no fine-tuning) 84.1% Romanian not supported
Run 5, epoch 5 20.4% Early training
Run 5, epoch 10 14.2% Improving
Run 4, epoch 45 (v1) 13.8% Previous best, 5 voices
Run 5, epoch 15 (v2) 12.4% Best overall, 15 voices

Inference Parameters

Parameter Recommended Description
temperature 0.3 Sampling temperature. Lower = more deterministic.
top_p 0.7 Nucleus sampling threshold.
top_k 30 Top-k sampling.
length_penalty 0.8 Values <1.0 discourage overly long output.
repetition_penalty 10.0 Penalizes repeated tokens.
enable_text_splitting True Split long texts at sentence boundaries.
max_new_tokens dynamic max(min(word_count * 50, 500), 150)

Model Files

File Size Description
config.json 4 KB XTTS-v2 configuration
model.pth 2.1 GB Fine-tuned model weights (Run 5, epoch 15)
dvae.pth 211 MB Discrete VAE for mel-spectrogram tokenization
mel_stats.pth 1 KB Mel-spectrogram normalization statistics
vocab.json 264 KB Extended vocabulary with Romanian diacritics + [ro]
speakers_xtts.pth 8 MB Speaker embedding defaults
reference_voices/ ~5 MB ~6s WAV clips for each of the 15 voices
samples/ ~4 MB One generated sample per voice
scripts/ ~80 KB Training, evaluation, and setup scripts

Limitations

  • Stop token issue — Romanian was not in XTTS-v2's original language set, so there is no explicit stop token. Use max_new_tokens and enable_text_splitting to mitigate hallucination/truncation (see Known Issue section above).
  • Cedilla normalization is mandatory — forgetting to normalize input text will silently degrade output quality for any word containing ș or ț.
  • Tongue twisters remain challenging — rapid repetitive syllables cause truncation or diacritics dropping (24-26% WER on the hardest test sentences).
  • Whisper-based evaluation — WER is measured by Whisper large-v3, which may have its own biases on Romanian. No human evaluation has been conducted.
  • Library patches required — TTS 0.22.0 needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See scripts/setup_runpod.sh.

License

This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.

Attribution

Citation

@misc{musat2026xttsromanian2,
  title={Fine-tuning XTTS-v2 for Romanian v2: Multi-Speaker Training with 470 Speakers},
  author={Musat, Eduard},
  year={2026},
  url={https://huggingface.co/eduardem/xtts-v2-romanian-v2}
}

Links

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eduardem/xtts-v2-romanian-v2

Base model

coqui/XTTS-v2
Finetuned
(64)
this model

Datasets used to train eduardem/xtts-v2-romanian-v2