Configuration Parsing Warning:Invalid JSON for config file config.json

XTTS-v2 Romanian v2

Fine-tuned XTTS-v2 for high-quality Romanian text-to-speech with voice cloning. Achieves 12.4% WER (measured by Whisper large-v3) across 15 distinct voices trained on ~471K clips from ~470 speakers.

This is the successor to XTTS-v2 Romanian v1, with 3x more voices and dramatically more speaker diversity.

Audio Samples

All samples below were generated with 0% WER (perfectly transcribed by Whisper large-v3).

Original Voices (from v1 dataset)

Costel (male, literary narration) — 9.9% avg WER

Mărioara (female, expressive storytelling) — 4.7% avg WER

Georgel (male, solemn delivery) — 8.4% avg WER

Lăcrămioara (female, clear broadcast style) — 12.0% avg WER

Dorel (male, conversational) — 13.9% avg WER

DS1 Voices (from romanian-speech-v2)

Adrian (male) — 16.0% avg WER

Ciprian (male) — 9.5% avg WER

Mihai (male) — 13.4% avg WER

Raluca (female) — 14.0% avg WER

Vasile (male) — 12.4% avg WER

DS2 Voices (from TTS-Romanian)

Elena (female) — 18.6% avg WER

Ioana (female) — 17.6% avg WER

Ana (female) — 14.5% avg WER

Cristina (female) — 11.0% avg WER

Ion (male) — 10.9% avg WER

Comparison with v1

	v1 (Run 4, epoch 45)	v2 (Run 5, epoch 15)
Overall WER	6.3%	12.4%
Voices	5	15
Training speakers	5	~470
Training clips	~62K	~471K
Training hours	~150h	~700h+
Test sentences	10 (standard)	9 (incl. tongue twisters)
Normal sentence WER	~6%	0-4%

Why is v2's overall WER higher? The test set includes 4 tongue twisters (s05-s08) that are deliberately challenging. On normal conversational sentences (s01-s04), v2 achieves 0-3% WER — comparable to or better than v1. The tongue twisters (s06 "Un vultur stă pe pisc...", s08 "Cărămidarul cărămidărește...") push the average up to 12.4%.

v2's key advantage is speaker diversity — trained on ~470 speakers vs 5, it generalizes much better to unseen voices during voice cloning.

Quick Start

Installation

pip install TTS==0.22.0

Download Model

git lfs install
git clone https://huggingface.co/eduardem/xtts-v2-romanian-v2
cd xtts-v2-romanian-v2

Or download individual files:

from huggingface_hub import hf_hub_download

for fname in ["config.json", "model.pth", "dvae.pth", "mel_stats.pth", "vocab.json", "speakers_xtts.pth"]:
    hf_hub_download(repo_id="eduardem/xtts-v2-romanian-v2", filename=fname, local_dir="xtts-v2-romanian-v2")

Basic Inference

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# ---------------------------------------------------------------
# REQUIRED: Normalize cedilla -> comma-below before every inference
# Without this, diacritics will be silently mispronounced or skipped.
# ---------------------------------------------------------------
CEDILLA_TO_COMMA = str.maketrans({
    "\u015f": "\u0219",  # ş -> ș  (lowercase s)
    "\u0163": "\u021b",  # ţ -> ț  (lowercase t)
    "\u015e": "\u0218",  # Ş -> Ș  (uppercase S)
    "\u0162": "\u021a",  # Ţ -> Ț  (uppercase T)
})

# Load model
config = XttsConfig()
config.load_json("xtts-v2-romanian-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="xtts-v2-romanian-v2", use_deepspeed=False)
model.cuda()

# REQUIRED: Increase Romanian character limit (default is too low)
model.tokenizer.char_limits["ro"] = 250

# Prepare text -- always normalize!
text = "Bună ziua, mă numesc Alexandru și sunt din București."
text = text.translate(CEDILLA_TO_COMMA)

# Clone a voice from a ~6s reference clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["xtts-v2-romanian-v2/reference_voices/costel.wav"]
)

# Calculate max_new_tokens to prevent hallucination
word_count = len(text.split())
max_gen_tokens = max(min(word_count * 50, 500), 150)

# Generate speech
out = model.inference(
    text=text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
    enable_text_splitting=True,
    max_new_tokens=max_gen_tokens,
)

torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Long Text Generation

For texts longer than ~30 words, always use enable_text_splitting=True:

long_text = """
România este o țară situată în sud-estul Europei. Capitala și cel mai mare
oraș este București. Țara are o populație de aproximativ nouăsprezece
milioane de locuitori și o suprafață de două sute patruzeci de mii de
kilometri pătrați.
""".strip().translate(CEDILLA_TO_COMMA)

out = model.inference(
    text=long_text,
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.3,
    top_p=0.7,
    top_k=30,
    length_penalty=0.8,
    repetition_penalty=10.0,
    enable_text_splitting=True,
)

Trimming Trailing Silence

XTTS-v2 sometimes generates trailing silence. Use this helper to trim it:

import numpy as np

def trim_trailing_silence(wav, sr=24000, threshold_db=-40, window_ms=25, margin_ms=50):
    """Trim trailing silence from generated audio."""
    if isinstance(wav, torch.Tensor):
        wav_np = wav.cpu().numpy()
    else:
        wav_np = np.array(wav)
    if wav_np.ndim > 1:
        wav_np = wav_np.squeeze()
    window = int(sr * window_ms / 1000)
    margin = int(sr * margin_ms / 1000)
    threshold = 10 ** (threshold_db / 20)
    # Scan backwards to find last non-silent frame
    for i in range(len(wav_np) - window, 0, -window):
        rms = np.sqrt(np.mean(wav_np[i:i+window] ** 2))
        if rms > threshold:
            end = min(i + window + margin, len(wav_np))
            return torch.tensor(wav_np[:end]) if isinstance(wav, torch.Tensor) else wav_np[:end]
    return wav

Voice Cloning with Your Own Voice

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["my_voice.wav"]  # WAV, ~6 seconds, clear speech
)

out = model.inference(
    text="Aceasta este o propoziție de test în limba română.".translate(CEDILLA_TO_COMMA),
    language="ro",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
)

Known Issue: Stop Token Behavior

XTTS-v2 has no explicit stop token for Romanian (it wasn't in the original 16-language set). This causes two potential issues:

Hallucination — Without max_new_tokens, the model may generate extra speech after the intended text ends (repeating phrases, adding unrelated words, or producing babble).
Truncation — With too-aggressive max_new_tokens, long texts may be cut short.

Recommended mitigations:

Strategy	How
Dynamic token limit	`max_new_tokens = max(min(word_count * 50, 500), 150)`
Text splitting	`enable_text_splitting=True` for texts > 30 words
Silence trimming	Use `trim_trailing_silence()` on output
Duration check	Flag outputs > 2x expected duration for re-generation

Voices

Fifteen voices are included with reference audio clips in the reference_voices/ directory.

Original Dataset Voices

Voice	Speaker ID	Gender	Description	WER
Costel	`speaker_male_literature`	M	Literary narration	9.9%
Mărioara	`speaker_female_hp`	F	Expressive storytelling	4.7%
Lăcrămioara	`speaker_female_adr`	F	Clear broadcast style	12.0%
Georgel	`speaker_male_bible`	M	Solemn delivery	8.4%
Dorel	`speaker_male_4`	M	Conversational	13.9%

DS1 Voices (romanian-speech-v2)

Voice	Speaker ID	Gender	WER
Adrian	`ds1_Adrian`	M	16.0%
Ciprian	`ds1_Ciprian`	M	9.5%
Mihai	`ds1_Mihai`	M	13.4%
Raluca	`ds1_Raluca`	F	14.0%
Vasile	`ds1_Vasile`	M	12.4%

DS2 Voices (TTS-Romanian / datadriven-company)

Voice	Speaker ID	Gender	WER
Elena	`ds2_cartia_100`	F	18.6%
Ioana	`ds2_cartia_142`	F	17.6%
Ana	`ds2_cartia_254`	F	14.5%
Cristina	`ds2_cartia_486`	F	11.0%
Ion	`ds2_cartia_490`	M	10.9%

WER is measured per-voice across 9 test sentences (including 4 tongue twisters) using Whisper large-v3.

Training

Datasets

Dataset	Speakers	Clips	Hours	Source
eduardem/romanian-speech-v2	~21	~247K	~350h	Audiobooks, broadcasts
datadriven-company/TTS-Romanian	~456	~264K	~350h	Read speech corpus
Combined	~470	~471K	~700h

Infrastructure

GPU: NVIDIA RTX 5000 Ada (32 GB) on RunPod
Training time: ~15 epochs of Phase 2
Framework: TTS==0.22.0 + trainer==0.0.36

Two-Phase Training Strategy

Phase 1 — Embedding Warmup (1 epoch)

Freeze all GPT layers; train only text_embedding and text_pos_embedding (1.4% of parameters)
Learning rate: 1e-4
Purpose: bring the 6 new Romanian token embeddings (ă, â, î, ș, ț, [ro]) from random initialization to a meaningful representation

Phase 2 — Full GPT Fine-Tune (15 epochs)

Unfreeze all GPT layers
Learning rate: 5e-6
Batch size: 4, gradient accumulation: 63 (effective batch size = 252)
Optimizer: AdamW (betas=0.9, 0.96)
text_ce_weight=0.01 (auxiliary regularizer, not primary objective)
gpt_use_masking_gt_prompt_approach=True (prevents reference audio parroting)

Critical Training Parameters

These values are non-negotiable for XTTS-v2 fine-tuning:

Effective batch size >= 252 — the official Coqui recipe minimum. Smaller batches cause mode collapse.
text_ce_weight = 0.01 — increasing this breaks the mel/text loss balance and causes text-prediction shortcuts.
gpt_use_masking_gt_prompt_approach = True — without this, the model learns to copy the reference audio instead of conditioning on input text.

Evaluation

Per-Sentence WER (run5_epoch15, averaged across 15 voices)

Sentence	WER	Type	Text
s03	0.0%	Conversational	Bună ziua, mă numesc Alexandru și sunt din București.
s01	1.2%	Historical	Ștefan cel Mare a construit mănăstiri și cetăți în întreaga Moldovă.
s04	2.7%	Conversational	Bună ziua, mă numesc Alexandra și sunt din Cluj-Napoca.
s05	9.3%	Tongue twister	S-a suit capra pe piatră...
s07	11.0%	Tongue twister	Ce-ntâmplare întâmplăreață...
s02	15.6%	Technical	Fișierele și rețelele informatice sunt esențiale în științele moderne.
s09	21.6%	Long passage	Ne trebuie împăcarea generațiilor...
s08	24.8%	Tongue twister	Cărămidarul cărămidărește cu cărămida cărămidarului...
s06	26.0%	Tongue twister	Un vultur stă pe pisc cu un pix în plisc.

Error Patterns

The dominant error type is diacritics dropping — the model occasionally produces the correct word but without diacritics (e.g., "manastiri" instead of "mănăstiri"). This accounts for the majority of WER errors, especially on diacritics-heavy sentences. Truncation, hallucination, and repetition are rare (<10 occurrences each across 135 samples).

Checkpoint Comparison

Checkpoint	Avg WER	Notes
Base model (no fine-tuning)	84.1%	Romanian not supported
Run 5, epoch 5	20.4%	Early training
Run 5, epoch 10	14.2%	Improving
Run 4, epoch 45 (v1)	13.8%	Previous best, 5 voices
Run 5, epoch 15 (v2)	12.4%	Best overall, 15 voices

Inference Parameters

Parameter	Recommended	Description
`temperature`	0.3	Sampling temperature. Lower = more deterministic.
`top_p`	0.7	Nucleus sampling threshold.
`top_k`	30	Top-k sampling.
`length_penalty`	0.8	Values <1.0 discourage overly long output.
`repetition_penalty`	10.0	Penalizes repeated tokens.
`enable_text_splitting`	True	Split long texts at sentence boundaries.
`max_new_tokens`	dynamic	`max(min(word_count * 50, 500), 150)`

Model Files

File	Size	Description
`config.json`	4 KB	XTTS-v2 configuration
`model.pth`	2.1 GB	Fine-tuned model weights (Run 5, epoch 15)
`dvae.pth`	211 MB	Discrete VAE for mel-spectrogram tokenization
`mel_stats.pth`	1 KB	Mel-spectrogram normalization statistics
`vocab.json`	264 KB	Extended vocabulary with Romanian diacritics + `[ro]`
`speakers_xtts.pth`	8 MB	Speaker embedding defaults
`reference_voices/`	~5 MB	~6s WAV clips for each of the 15 voices
`samples/`	~4 MB	One generated sample per voice
`scripts/`	~80 KB	Training, evaluation, and setup scripts

Limitations

Stop token issue — Romanian was not in XTTS-v2's original language set, so there is no explicit stop token. Use max_new_tokens and enable_text_splitting to mitigate hallucination/truncation (see Known Issue section above).
Cedilla normalization is mandatory — forgetting to normalize input text will silently degrade output quality for any word containing ș or ț.
Tongue twisters remain challenging — rapid repetitive syllables cause truncation or diacritics dropping (24-26% WER on the hardest test sentences).
Whisper-based evaluation — WER is measured by Whisper large-v3, which may have its own biases on Romanian. No human evaluation has been conducted.
Library patches required — TTS 0.22.0 needs patches for PyTorch 2.x compatibility and Romanian tokenizer support. See scripts/setup_runpod.sh.

License

This model is released under the Coqui Public Model License (CPML), inherited from the base XTTS-v2 model.

Attribution

XTTS-v2 base model: Coqui AI / TTS
Training data: eduardem/romanian-speech-v2, datadriven-company/TTS-Romanian
Whisper evaluation: OpenAI Whisper
v1 model: eduardem/xtts-v2-romanian

Citation

@misc{musat2026xttsromanian2,
  title={Fine-tuning XTTS-v2 for Romanian v2: Multi-Speaker Training with 470 Speakers},
  author={Musat, Eduard},
  year={2026},
  url={https://huggingface.co/eduardem/xtts-v2-romanian-v2}
}

Model tree for eduardem/xtts-v2-romanian-v2

Base model

coqui/XTTS-v2

Finetuned

(64)

this model

eduardem
/

xtts-v2-romanian-v2