---
library_name: transformers
language:
- ar
license: apache-2.0
base_model: openai/whisper-medium
tags:
- generated_from_trainer
datasets:
- ymoslem/MediaSpeech
- google/fleurs
- UBC-NLP/Casablanca
- fixie-ai/common_voice_17_0
- deepdml/Tunisian_MSA
metrics:
- wer
model-index:
- name: Whisper Medium ar
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 17.0
      type: ymoslem/MediaSpeech
    metrics:
    - name: Wer
      type: wer
      value: 20.467857733056682
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Whisper Medium ar

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the Common Voice 17.0 dataset.
It achieves the following results on the evaluation set:
- Loss: 0.2149
- Wer: 20.4679
- Cer: 5.6352

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.04
- training_steps: 18000

### Training results

| Training Loss | Epoch  | Step  | Validation Loss | Wer     | Cer    |
|:-------------:|:------:|:-----:|:---------------:|:-------:|:------:|
| 0.4929        | 0.0556 | 1000  | 0.3300          | 28.9234 | 9.0009 |
| 0.2883        | 0.1111 | 2000  | 0.2984          | 27.7612 | 7.8800 |
| 0.142         | 0.1667 | 3000  | 0.2847          | 25.8332 | 7.5636 |
| 0.0746        | 0.2222 | 4000  | 0.2812          | 25.1152 | 7.3684 |
| 0.0501        | 0.2778 | 5000  | 0.2702          | 24.9463 | 7.1645 |
| 0.0421        | 0.3333 | 6000  | 0.2640          | 24.9610 | 7.1298 |
| 0.0292        | 0.3889 | 7000  | 0.2574          | 23.3984 | 6.6850 |
| 0.0291        | 0.4444 | 8000  | 0.2575          | 23.1523 | 6.5031 |
| 0.0216        | 0.5    | 9000  | 0.2555          | 24.4983 | 6.7680 |
| 0.0179        | 0.5556 | 10000 | 0.2440          | 22.4142 | 6.1291 |
| 0.0166        | 0.6111 | 11000 | 0.2416          | 21.7183 | 6.0801 |
| 0.0104        | 0.6667 | 12000 | 0.2405          | 22.0525 | 6.1413 |
| 0.0107        | 0.7222 | 13000 | 0.2457          | 22.5336 | 6.1634 |
| 0.01          | 0.7778 | 14000 | 0.2374          | 21.2758 | 5.8735 |
| 0.0155        | 0.8333 | 15000 | 0.2317          | 22.0727 | 5.9926 |
| 0.0081        | 0.8889 | 16000 | 0.2285          | 20.8296 | 5.7606 |
| 0.0051        | 0.9444 | 17000 | 0.2250          | 20.7121 | 5.6673 |
| 0.0067        | 1.0    | 18000 | 0.2149          | 20.4679 | 5.6352 |


### Framework versions

- Transformers 4.48.0.dev0
- Pytorch 2.5.1+cu121
- Datasets 3.6.0
- Tokenizers 0.21.0

## Citation

Please cite the model using the following BibTeX entry:

```bibtex
@misc{deepdml/whisper-medium-ar-mix-norm,
      title={Fine-tuned Whisper medium ASR model for speech recognition in Arabic},
      author={Jimenez, David},
      howpublished={\url{https://huggingface.co/deepdml/whisper-medium-ar-mix-norm}},
      year={2026}
    }
```