Whisper-Tiny Dutch - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-tiny for Dutch automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Dutch without any synthetic data augmentation, serving as a baseline for comparison with synthetic-augmented models.

Introduction

Purpose

This model serves as the baseline for evaluating the effectiveness of synthetic data augmentation in Dutch ASR. By training only on real speech data from Common Voice 17.0, we establish reference performance metrics against which synthetic-augmented models can be compared.

How the Model Was Created

The model was fine-tuned from openai/whisper-tiny using the Hugging Face Transformers library:

  1. Training Data: 34,952 real speech samples from Common Voice 17.0 Dutch (train split).

  2. Optimization: Trained for 5 epochs with a learning rate of 5e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.

  3. Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 550 with a validation loss of 0.3365.

This baseline achieves 26.00% WER on the Common Voice test set, which synthetic-augmented models improve upon by up to 4.1% relative.

Model Details

Property Value
Base Model openai/whisper-tiny
Language Dutch (nl)
Task Automatic Speech Recognition (transcribe)
Parameters 39M
Training Data Common Voice 17.0 Dutch only
Total Training Samples 34,952
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-tiny-cv-only-nl)

Metric Value
Validation Loss 0.3382
Validation WER 19.77%
Test WER (Common Voice) 26.00%
Test WER (MLS) 44.85%
Best Checkpoint Step 550
Max Training Steps 680

Comparison with Synthetic-Augmented Models

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS)
Common Voice Only 680 0.3382 19.77% 26.00% 44.85%
High-Quality Filtered + CV 890 0.3323 19.59% 25.51% 43.76%
Mid-High Quality Filtered + CV 1,270 0.3292 19.36% 25.05% 43.11%
All Synthetic + CV (Unfiltered) 1,365 0.3207 19.61% 24.93% 43.12%

Key Observations

  • Baseline performance: 26.00% Test WER on Common Voice, 44.85% on MLS
  • Fastest training: Only 680 max steps (smallest dataset)
  • Room for improvement: Synthetic augmentation reduces Test WER by up to 1.07% absolute (4.1% relative)
  • Cross-domain gap: 18.85% absolute difference between CV and MLS performance highlights domain mismatch

Training Data

Dataset

Source Samples Description
Common Voice 17.0 Dutch 34,952 Real speech from Mozilla's crowdsourced dataset

Common Voice 17.0 Dutch contains crowdsourced voice recordings from volunteer contributors reading text prompts. The dataset provides diverse speaker demographics but is limited in acoustic conditions and speaking styles.

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-5
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.4509
Step  200: val_loss = 0.3778
Step  300: val_loss = 0.3544
Step  400: val_loss = 0.3383
Step  550: val_loss = 0.3365 ← Best checkpoint
Step  650: val_loss = 0.3397

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-cv-only-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal for:

  • Baseline comparisons: Evaluating the impact of synthetic data augmentation
  • Real-data-only requirements: When synthetic data usage is not permitted
  • Minimal training: Fastest training time among all configurations

For better performance, consider the synthetic-augmented variants:

Limitations

  • Model capacity: Whisper-Tiny (39M params) has limited representational power
  • No synthetic augmentation: Does not benefit from additional acoustic diversity
  • Domain specificity: Trained only on Common Voice; limited generalization to other domains
  • Cross-domain performance: Significant performance drop on MLS benchmark (44.85% vs 26.00%)
  • Dialect coverage: Performance may vary across Dutch regional variants

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
10
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-tiny-cv-only-nl

Finetuned
(1664)
this model

Dataset used to train yuriyvnv/whisper-tiny-cv-only-nl

Collection including yuriyvnv/whisper-tiny-cv-only-nl

Evaluation results