whisper-small-swh-finetuned

This model is a fine-tuned version of OpenAI's Whisper-small specifically for Swahili Automatic Speech Recognition (ASR). It was fine-tuned on the Swahili portion of the FLEURS-SLU dataset to significantly improve transcription accuracy for the Swahili language.

  • Developed by: Daniel Amemba Odhiambo
  • Model type: Whisper-small for Automatic Speech Recognition
  • Language: Swahili (swh,sw,Aswh)
  • License: MIT

Performance Highlights

The fine-tuned model achieves a 68% relative improvement in Word Error Rate (WER) over the base model on the Swahili test set.

Model WER (%)
openai/whisper-small (baseline) 103.10
adoamesh/whisper-small-swh-finetuned (this model) 32.07

Model Description

This model is based on the openai/whisper-small architecture, which features a Transformer-based encoder-decoder structure. It has been specifically adapted for Swahili by continuing pre-training on Swahili audio data.

  • Base Model: openai/whisper-small
  • Fine-tuned for: 1000 steps
  • Training Data: Swahili split of the FLEURS-SLU dataset (3.62 hours training, 0.60 hours testing)

Uses

Direct Use

This model is intended for transcribing Swahili speech to text. Primary use cases include:

  • Transcription of Swahili audio and video content.
  • Building speech-to-text applications for Swahili.
  • Research in low-resource language speech recognition.

Downstream Use

The model can be fine-tuned further for specific domains or accents within Swahili.

Out-of-Scope Use

  • Not recommended for real-time, production-critical systems without further evaluation and testing.
  • Not recommended for transcribing other languages.
  • Must not be used for surveillance, discriminatory purposes, or any application that violates human rights.

How to Get Started with the Model

Usage - Comparison with OpenAI Whisper Small ( Model)

TESTED ON WINDOWS 11:

TESTED ON WINDOWS 11: Click to expand comparison between the Whisper small vs the finetuned model: code
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small from Hugging Face Hub
finetuned_model = WhisperForConditionalGeneration.from_pretrained("adoamesh/whisper-small-swh-finetuned")
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()

# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()

# --- 3️⃣ Load audio file using librosa ---
audio_path = "Recording.wav"
try:
    waveform, sample_rate = librosa.load(audio_path, sr=16000)
    waveform = torch.from_numpy(waveform).float().unsqueeze(0)
except Exception as e:
    print(f"Error loading audio with librosa: {e}")
    exit(1)

# Prepare input features
input_features = processor(
    waveform.numpy()[0],
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# --- 4️⃣ Transcribe ---
with torch.no_grad():
    # Fine-tuned
    predicted_ids_finetuned = finetuned_model.generate(input_features)
    transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]

    # Original
    predicted_ids_original = original_model.generate(input_features)
    transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]

# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")
Transcriptions comparison between the Whisper small vs the finetuned model: code
Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda.
Original Whisper-Small (OpenAI):  Nili mwabi ayule bindi kwa mba nampe dha.

Training Details

Training Data

The model was fine-tuned on the Swahili (sw) split of the FLEURS-SLU dataset.

Dataset Split Duration (hours)
Training 3.62
Test 0.60
Total 4.22

Training Procedure

  • Steps: 1000
  • Learning Rate: 3e-5
  • Batch Size: 16
  • Warmup Steps: 200
  • Gradient Checkpointing: Enabled
  • Precision: FP16

Training Results

Metric Value
Final Training Loss 0.0007
Validation Loss 0.8038
Final WER 32.07%
Training Runtime ~2.3 hours
Total Epochs 19.6

Evaluation

Results vs. Baseline

The primary metric for evaluation is Word Error Rate (WER) on the Swahili test set. The fine-tuned model shows a dramatic improvement.

Model WER (%) Relative Improvement
openai/whisper-small 103.10 -
adoamesh/whisper-small-swh-finetuned 32.07 +68%

Example Transcription

Audio: "Nilimwambia yule bibi kwamba nampenda." (I told that lady that I love her.)

Model Transcription
Fine-tuned Model Ni limwambia yulebindi kwamba na mpenda.
Base Model Nili mwabi aiyule bindi kwa mba nampe dha.

The fine-tuned model produces a more accurate and natural-sounding transcription.

Limitations and Bias

  • Dialectal Bias: Performance may vary across different Swahili dialects and accents not well-represented in the training data.
  • Domain Specificity: The model may struggle with technical terminology, slang, or domain-specific vocabulary (e.g., medical, legal).
  • Audio Quality: Performance will degrade with poor-quality, noisy, or low-fidelity audio recordings.
  • Data Scale: The model was trained on a relatively small dataset (4.22 hours). Performance can likely be improved with more diverse and extensive Swahili audio data.

Environmental Impact

  • Hardware Type: 1 x NVIDIA GPU (exact type not specified)
  • Hours used: ~1.3 hours

Citation

If you use this model, please cite both the original Whisper paper and this model.

@misc{odhiambo2025whispersmallswahili,
  author = {Odhiambo, Daniel Amemba},
  title = {Whisper Small Swahili Fine-tuned},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/adoamesh/whisper-small-swh-finetuned}}
}
@misc{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year = {2022},
  eprint = {2212.04356},
  archivePrefix = {arXiv},
  primaryClass = {eess.AS}
}

Acknowledgements


This model card was authored by Daniel Amemba Odhiambo and redesigned with the help of an AI assistant.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adoamesh/whisper-small-swh-finetuned

Finetuned
(3350)
this model

Papers for adoamesh/whisper-small-swh-finetuned