Whisper-Tiny Dutch - Mixed Synthetic Data (Mid-High Quality Filtered)

This model is a fine-tuned version of openai/whisper-tiny for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with WAVe-filtered synthetic speech data (quality threshold q ≥ 0.5).

Introduction

How the Data Was Created

The training data combines real speech from Common Voice 17.0 with synthetic speech generated through a two-stage pipeline:

Transcript Generation: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.
Speech Synthesis: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
Quality Filtering with WAVe: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied WAVe (Word-Aligned Verification), a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. Samples scoring below the threshold (q < 0.5) were removed, retaining 30,182 high-quality synthetic samples.

How the Model Was Created

The model was fine-tuned from openai/whisper-tiny using the Hugging Face Transformers library with the following approach:

Mixed Training: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 30,182 WAVe-filtered synthetic samples (65,134 total).
Optimization: Trained for 5 epochs with a learning rate of 5e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 1000 with a validation loss of 0.3292.

This approach achieves 3.7% relative improvement on the Common Voice test set compared to training on real data alone, while also improving cross-domain generalization on the Multilingual LibriSpeech benchmark.

Model Details

Property	Value
Base Model	openai/whisper-tiny
Language	Dutch (nl)
Task	Automatic Speech Recognition (transcribe)
Parameters	39M
Training Data	Common Voice 17.0 + Mid-High Quality Synthetic
Total Training Samples	65,134
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-tiny-mixed-nl)

Metric	Value
Validation Loss	0.3292
Validation WER	19.36%
Test WER (Common Voice)	25.05%
Test WER (MLS)	43.11%
Best Checkpoint	Step 1000
Max Training Steps	1,270

Comparison with Other Training Configurations

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	680	0.3382	19.77%	26.00%	44.85%
High-Quality Filtered + CV	890	0.3323	19.59%	25.51%	43.76%
Mid-High Quality Filtered + CV	1,270	0.3292	19.36%	25.05%	43.11%
All Synthetic + CV (Unfiltered)	1,365	0.3207	19.61%	24.93%	43.12%

Key Performance Highlights

Best Validation WER (19.36%) among all Whisper-Tiny Dutch configurations
Best cross-domain generalization on MLS benchmark (43.11% WER)
3.7% relative improvement on Common Voice test set vs baseline (25.05% vs 26.00%)
7% fewer training steps than unfiltered synthetic data while achieving better generalization

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Dutch	34,952	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript NL (q ≥ 0.5)	30,182	WAVe-filtered TTS audio from GPT-4o-mini transcripts
Total	65,134

Synthetic Data Generation Pipeline

The synthetic dataset (yuriyvnv/synthetic_transcript_nl) was generated using:

Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
Quality Filtering: WAVe model filtering at threshold q ≥ 0.5

WAVe Quality Distribution (Dutch Synthetic Data)

Quality Level	Samples	Percentage
High (q ≥ 0.8)	10,555	30.2%
Medium (0.5 ≤ q < 0.8)	19,627	56.2%
Low (q < 0.5) - Removed	4,716	13.5%

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.5209
Step  250: val_loss = 0.3979
Step  500: val_loss = 0.3454
Step  750: val_loss = 0.3300
Step 1000: val_loss = 0.3292 ← Best checkpoint
Step 1250: val_loss = 0.3334

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-mixed-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-mixed-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-mixed-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

Methodology

This model leverages WAVe (Word-Aligned Verification), a word-level quality assessment method for filtering synthetic speech data. Unlike sentence-level filtering approaches, WAVe:

Aligns each word to its corresponding audio frames using multi-head attention
Assigns per-word confidence scores via a GLU-based scorer
Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
Achieves 6.5% improvement over sentence-level filtering methods

For full methodology details, see the references below.

Limitations

Model capacity: Whisper-Tiny (39M params) has limited representational power
Domain specificity: Optimized for general Dutch; may underperform on technical domains
Acoustic conditions: Trained on clean speech; noise robustness not guaranteed
Dialect coverage: Performance may vary across Dutch regional variants

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}