SpecDox: Faster Whisper Urdu-to-English Model (CTranslate2)

This is the CTranslate2 / Faster Whisper version of the highly optimized SpecDox Whisper Medium model. It performs Automatic Speech Recognition (ASR) and Audio Translation, taking spoken Urdu (اردو) and converting it into written English text.

This repository contains model weights designed for production environments. By utilizing the CTranslate2 engine, this model runs up to 4x faster and uses significantly less VRAM than the standard Hugging Face Transformers implementation, making it ideal for real-time and edge deployment.


πŸš€ Key Features

  • Blazing Fast Inference: Powered by CTranslate2 for real-time translation on both CPU and GPU.
  • Low VRAM Footprint: Fits on consumer-grade GPUs or lightweight cloud instances.
  • Massive Training Data: Trained on 127 hours of Urdu-to-English speech, expanded to 172 hours via data augmentation.
  • PEFT / LoRA Optimized: Fine-tuned with LoRA adapters, then merged and converted to CTranslate2 format.

πŸ“Š Evaluation & Performance

The table below compares SpecDox models against standard baselines. Converting to Faster Whisper retains near-perfect accuracy parity while vastly improving throughput.

Model WER% ↓ BLEU ↑ METEOR ↑ BERTScore F1 ↑ Rank
SpecDox-Whisper-Medium (Standard) 36.25 53.30 0.7804 0.9405 #1
SpecDox-faster-medium (Faster Whisper) 36.28 53.24 0.7811 0.9402 #2
Whisper Large-v3 42.88 46.86 0.7105 0.9270 #3
Whisper Medium (Baseline) 45.33 44.16 0.6882 0.9226 #4
SeamlessM4T Medium 72.04 18.84 0.3697 0.8429 #5

Engineering Takeaway: The Faster Whisper version of SpecDox achieves a 6.6% absolute reduction in WER compared to OpenAI's Whisper Large-v3, at a fraction of the computational cost.


πŸ’» Usage

This model uses the CTranslate2 engine. Use the faster-whisper library instead of transformers.

1. Install

pip install faster-whisper

2. Run Inference

from faster_whisper import WhisperModel

# Load model from HuggingFace Hub or a local path
model_path = "Shzaib/SpecDox-Faster-Whisper"

# GPU with FP16 β€” use device="cpu" and compute_type="int8" if no GPU available
model = WhisperModel(model_path, device="cuda", compute_type="float16")

# Translate Urdu audio to English
audio_file = "path/to/your/urdu_audio.wav"

# task="translate" β†’ English output | language="ur" β†’ skip language detection
segments, info = model.transcribe(audio_file, task="translate", language="ur")

print(f"Detected language '{info.language}' with probability {info.language_probability:.2f}")
print("--- Translation ---")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

πŸ“„ License

This model is released under the Apache 2.0 License.

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Shzaib/SpecDox-Faster-Whisper

Finetuned
(879)
this model