| | --- |
| | library_name: transformers |
| | tags: |
| | - asr |
| | license: cc-by-nc-4.0 |
| | language: |
| | - ar |
| | pipeline_tag: automatic-speech-recognition |
| | --- |
| | # ArTST-v3 (ASR task) |
| |
|
| | ArTST model finetuned for automatic speech recognition (speech-to-text) on QASR (best for Dialectal Arabic Variants) |
| |
|
| |
|
| | ### Model Description |
| |
|
| | - **Developed by:** Speech Lab, MBZUAI |
| | - **Model type:** SpeechT5 |
| | - **Language:** Arabic |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | ```python |
| | import soundfile as sf |
| | from transformers import ( |
| | SpeechT5Config, |
| | SpeechT5FeatureExtractor, |
| | SpeechT5ForSpeechToText, |
| | SpeechT5Processor, |
| | SpeechT5Tokenizer, |
| | ) |
| | |
| | |
| | device = "cuda" if torch.cuda.is_available() else "CPU" |
| | |
| | model_id="mbzuai/artst_asr_v3_qasr" |
| | |
| | tokenizer = SpeechT5Tokenizer.from_pretrained(model_id) |
| | processor = SpeechT5Processor.from_pretrained(model_id , tokenizer=tokenizer) |
| | model = SpeechT5ForSpeechToText.from_pretrained(model_id).to(device) |
| | |
| | audio, sr = sf.read("audio.wav") |
| | |
| | inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt") |
| | predicted_ids = model.generate(**inputs.to(device), max_length=150, num_beams=10) |
| | |
| | transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
| | print(transcription[0]) |
| | ``` |
| |
|
| | or using pipeline |
| |
|
| | ```python |
| | import librosa |
| | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| | |
| | |
| | model_id="mbzuai/artst_asr_v3" |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device) |
| | pipe = pipeline( |
| | "automatic-speech-recognition", |
| | model=model, |
| | tokenizer=processor.tokenizer, |
| | feature_extractor=processor.feature_extractor, |
| | torch_dtype=torch_dtype, |
| | device=device, |
| | ) |
| | |
| | wav, sr = librosa.load("audio.wav", sr=16000) |
| | pipe(wav, generate_kwargs={'num_beams': 10, 'early_stopping': True})['text'] |
| | ``` |
| |
|
| |
|
| |
|
| |
|
| |
|
| | ### Model Sources [optional] |
| | - **Repository:** [github](https://github.com/mbzuai-nlp/ArTST) |
| | - **Paper :** [ArXiv](https://arxiv.org/pdf/2411.05872) |
| | <!-- - **Demo [optional]:** [More Information Needed] --> |
| |
|
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| | ``` |
| | @misc{djanibekov2024dialectalcoveragegeneralizationarabic, |
| | title={Dialectal Coverage And Generalization in Arabic Speech Recognition}, |
| | author={Amirbek Djanibekov and Hawau Olamide Toyin and Raghad Alshalan and Abdullah Alitr and Hanan Aldarmaki}, |
| | year={2024}, |
| | eprint={2411.05872}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2411.05872}, |
| | } |
| | |
| | @inproceedings{toyin-etal-2023-artst, |
| | title = "{A}r{TST}: {A}rabic Text and Speech Transformer", |
| | author = "Toyin, Hawau and |
| | Djanibekov, Amirbek and |
| | Kulkarni, Ajinkya and |
| | Aldarmaki, Hanan", |
| | booktitle = "Proceedings of ArabicNLP 2023", |
| | month = dec, |
| | year = "2023", |
| | address = "Singapore (Hybrid)", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://aclanthology.org/2023.arabicnlp-1.5", |
| | doi = "10.18653/v1/2023.arabicnlp-1.5", |
| | pages = "41--51", |
| | } |
| | ``` |