Whisper-Small Portuguese - Common Voice Only (Baseline)
This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.
Purpose
This baseline model demonstrates the performance achievable with the smaller Whisper-Small architecture (244M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate:
- The effectiveness of synthetic data augmentation for smaller model architectures
- The limitations of compact models in leveraging synthetic speech
- Comparison with Large-v3 models to understand scaling effects
Key Finding: Unlike the Large-v3 models which show significant improvements with synthetic data, the Small model shows only marginal benefits from synthetic augmentation, achieving the best overall performance with Common Voice data alone.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-small |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 244M |
| Training Data | Common Voice 17.0 Portuguese (Real Speech Only) |
| Total Training Samples | 21,866 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-small-cv-only-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.2000 |
| Validation WER | 12.68% |
| Test WER (Common Voice) | 13.87% |
| Test WER (MLS) | 30.69% |
| Best Checkpoint | Step 250 |
| Max Training Steps | 430 |
Comparison with Synthetic Data Augmentation (Whisper-Small Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only (Baseline) | 430 | 0.2000 | 12.68% | 13.87% | 30.69% |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.2100 | 12.98% | 14.28% | 30.40% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.2100 | 12.97% | 14.08% | 30.54% |
| All Synthetic + CV | 860 | 0.2100 | 12.94% | 14.22% | 30.85% |
Key Performance Characteristics
- Best in-domain performance: 13.87% Test WER (best among all Small configurations)
- Best validation loss (0.2000) - optimal convergence
- Fastest training: Fewest steps (430) among all configurations
- Demonstrates architectural limitation: Synthetic data provides marginal or negative benefit for smaller models
Why Synthetic Data Doesn't Help Small Models
The paper's findings reveal a fundamental architectural limitation:
"The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity."
Key Insights:
- Limited capacity: Compact models (244M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech
- Conflicting signals: Unlike Large-v3 (1550M params), smaller models become overwhelmed by increased acoustic variability
- Diminishing returns: Synthetic data introduces complexity that these models cannot effectively reconcile
- Best strategy: For Small models, focus on high-quality real data rather than synthetic augmentation
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real crowdsourced speech |
| Synthetic Data | 0 | No synthetic augmentation |
| Total | 21,866 |
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-small-cv-only-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-only-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
This model is ideal when:
- Resource-constrained deployment: Need smaller model size (244M vs 1550M params)
- Portuguese ASR with limited compute: Best Small model performance without synthetic data overhead
- Baseline comparison: Reference for evaluating synthetic data impact on smaller architectures
- Production efficiency: Fastest inference among Portuguese models
Consider Large-v3 models for better accuracy:
- whisper-large-v3-high-mixed-pt: 7.94% WER (43% better)
- whisper-large-v3-cv-capes-filtered-pt: 6.89% MLS WER (78% better cross-domain)
Small vs Large: Architectural Impact
| Model | Params | Best Config | Test WER (CV) | Test WER (MLS) | Synthetic Benefit |
|---|---|---|---|---|---|
| Whisper-Small | 244M | CV Only | 13.87% | 30.69% | Minimal/Negative |
| Whisper-Large-v3 | 1550M | High-Quality + CV | 7.94% | 12.41% | Significant (+32.6%) |
Limitations
- Lower accuracy than Large models: 13.87% vs 7.94% WER (Large-v3)
- Limited synthetic data benefit: Adding synthetic data doesn't improve performance
- Domain specificity: Optimized for Common Voice-style speech
- Dialect coverage: Performance may vary across Portuguese regional variants
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-small
- Training Data: mozilla-foundation/common_voice_17_0
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 23
Model tree for yuriyvnv/whisper-small-cv-only-pt
Base model
openai/whisper-smallDataset used to train yuriyvnv/whisper-small-cv-only-pt
Collection including yuriyvnv/whisper-small-cv-only-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported13.870
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported30.690