whisper-small-swh-finetuned

This model is a fine-tuned version of OpenAI's Whisper-small specifically for Swahili Automatic Speech Recognition (ASR). It was fine-tuned on the Swahili portion of the FLEURS-SLU dataset to significantly improve transcription accuracy for the Swahili language.

Developed by: Daniel Amemba Odhiambo
Model type: Whisper-small for Automatic Speech Recognition
Language: Swahili (swh,sw,Aswh)
License: MIT

Performance Highlights

The fine-tuned model achieves a 68% relative improvement in Word Error Rate (WER) over the base model on the Swahili test set.

Model	WER (%)
`openai/whisper-small` (baseline)	103.10
`adoamesh/whisper-small-swh-finetuned` (this model)	32.07

Model Description

This model is based on the openai/whisper-small architecture, which features a Transformer-based encoder-decoder structure. It has been specifically adapted for Swahili by continuing pre-training on Swahili audio data.

Base Model: openai/whisper-small
Fine-tuned for: 1000 steps
Training Data: Swahili split of the FLEURS-SLU dataset (3.62 hours training, 0.60 hours testing)

Uses

Direct Use

This model is intended for transcribing Swahili speech to text. Primary use cases include:

Transcription of Swahili audio and video content.
Building speech-to-text applications for Swahili.
Research in low-resource language speech recognition.

Downstream Use

The model can be fine-tuned further for specific domains or accents within Swahili.

Out-of-Scope Use

Not recommended for real-time, production-critical systems without further evaluation and testing.
Not recommended for transcribing other languages.
Must not be used for surveillance, discriminatory purposes, or any application that violates human rights.

How to Get Started with the Model

Usage - Comparison with OpenAI Whisper Small ( Model)

TESTED ON WINDOWS 11:

TESTED ON WINDOWS 11: Click to expand comparison between the Whisper small vs the finetuned model: code

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small from Hugging Face Hub
finetuned_model = WhisperForConditionalGeneration.from_pretrained("adoamesh/whisper-small-swh-finetuned")
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()

# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()

# --- 3️⃣ Load audio file using librosa ---
audio_path = "Recording.wav"
try:
    waveform, sample_rate = librosa.load(audio_path, sr=16000)
    waveform = torch.from_numpy(waveform).float().unsqueeze(0)
except Exception as e:
    print(f"Error loading audio with librosa: {e}")
    exit(1)

# Prepare input features
input_features = processor(
    waveform.numpy()[0],
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# --- 4️⃣ Transcribe ---
with torch.no_grad():
    # Fine-tuned
    predicted_ids_finetuned = finetuned_model.generate(input_features)
    transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]

    # Original
    predicted_ids_original = original_model.generate(input_features)
    transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]

# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")

Transcriptions comparison between the Whisper small vs the finetuned model: code

Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda.
Original Whisper-Small (OpenAI):  Nili mwabi ayule bindi kwa mba nampe dha.

Training Details

Training Data

The model was fine-tuned on the Swahili (sw) split of the FLEURS-SLU dataset.

Dataset Split	Duration (hours)
Training	3.62
Test	0.60
Total	4.22

Training Procedure

Steps: 1000
Learning Rate: 3e-5
Batch Size: 16
Warmup Steps: 200
Gradient Checkpointing: Enabled
Precision: FP16

Training Results

Metric	Value
Final Training Loss	0.0007
Validation Loss	0.8038
Final WER	32.07%
Training Runtime	~2.3 hours
Total Epochs	19.6

Evaluation

Results vs. Baseline

The primary metric for evaluation is Word Error Rate (WER) on the Swahili test set. The fine-tuned model shows a dramatic improvement.

Model	WER (%)	Relative Improvement
`openai/whisper-small`	103.10	-
`adoamesh/whisper-small-swh-finetuned`	32.07	+68%

Example Transcription

Audio: "Nilimwambia yule bibi kwamba nampenda." (I told that lady that I love her.)

Model	Transcription
Fine-tuned Model	Ni limwambia yulebindi kwamba na mpenda.
Base Model	Nili mwabi aiyule bindi kwa mba nampe dha.

The fine-tuned model produces a more accurate and natural-sounding transcription.

Limitations and Bias

Dialectal Bias: Performance may vary across different Swahili dialects and accents not well-represented in the training data.
Domain Specificity: The model may struggle with technical terminology, slang, or domain-specific vocabulary (e.g., medical, legal).
Audio Quality: Performance will degrade with poor-quality, noisy, or low-fidelity audio recordings.
Data Scale: The model was trained on a relatively small dataset (4.22 hours). Performance can likely be improved with more diverse and extensive Swahili audio data.

Environmental Impact

Hardware Type: 1 x NVIDIA GPU (exact type not specified)
Hours used: ~1.3 hours

Citation

If you use this model, please cite both the original Whisper paper and this model.

@misc{odhiambo2025whispersmallswahili,
  author = {Odhiambo, Daniel Amemba},
  title = {Whisper Small Swahili Fine-tuned},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/adoamesh/whisper-small-swh-finetuned}}
}

@misc{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year = {2022},
  eprint = {2212.04356},
  archivePrefix = {arXiv},
  primaryClass = {eess.AS}
}