Improve model card with detailed abstract and sample usage

b152982 verified 5 months ago

5.03 kB

	---
	base_model: openai/whisper-tiny
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- audio
	- automatic-speech-recognition
	- whisper
	- hf-asr-leaderboard
	---

	Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency.

	For more technical details, see our [GitHub repository](https://github.com/efeslab/LiteASR) and [paper](https://arxiv.org/abs/2502.20583).

	## Sample Usage

	The easiest way to run our model is to use our integration with HuggingFace Transformers library.
	We provide model weights for the compressed version of OpenAI Whisper series [here](https://huggingface.co/efficient-speech).

	```python
	import librosa
	import torch
	from transformers import AutoProcessor, AutoModel

	device = "cuda:0"
	dtype = torch.float16

	# load the compressed Whisper model
	model = AutoModel.from_pretrained(
	"efficient-speech/lite-whisper-large-v3-turbo",
	trust_remote_code=True,
	)
	model.to(dtype).to(device)

	# we use the same processor as the original model
	processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")

	# set the path to your audio file
	path = "path/to/audio.wav"
	audio, _ = librosa.load(path, sr=16000)

	input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
	input_features = input_features.to(dtype).to(device)

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(
	predicted_ids,
	skip_special_tokens=True
	)[0]

	print(transcription)
	```

	## Benchmark Results

	Following is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted):

	\| Model \| Average WER (↓) \| Encoder Size \| Decoder Size \|
	\|-------\|----------------\|--------------\|--------------\|
	\| [whisper-tiny](https://huggingface.co/openai/whisper-tiny) \| 22.01 \| 7.63M \| 29.55M \|
	\| [lite-whisper-tiny-acc](https://huggingface.co/efficient-speech/lite-whisper-tiny-acc) \| 22.97 \| 7.41M \| 29.55M \|
	\| [lite-whisper-tiny](https://huggingface.co/efficient-speech/lite-whisper-tiny) \| 23.95 \| 7.00M \| 29.55M \|
	\| [lite-whisper-tiny-fast](https://huggingface.co/efficient-speech/lite-whisper-tiny-fast) \| 27.09 \| 6.48M \| 29.55M \|
	\|   \|   \|   \|   \|
	\| [whisper-base](https://huggingface.co/openai/whisper-base) \| 17.67 \| 19.82M \| 52.00M \|
	\| [lite-whisper-base-acc](https://huggingface.co/efficient-speech/lite-whisper-base-acc) \| 19.07 \| 18.64M \| 52.00M \|
	\| [lite-whisper-base](https://huggingface.co/efficient-speech/lite-whisper-base) \| 19.71 \| 17.44M \| 52.00M \|
	\| [lite-whisper-base-fast](https://huggingface.co/efficient-speech/lite-whisper-base-fast) \| 23.05 \| 16.07M \| 52.00M \|
	\|   \|   \|   \|   \|
	\| [whisper-small](https://huggingface.co/openai/whisper-small) \| 15.89 \| 87.00M \| 153.58M \|
	\| [lite-whisper-small-acc](https://huggingface.co/efficient-speech/lite-whisper-small-acc) \| 15.37 \| 76.99M \| 153.58M \|
	\| [lite-whisper-small](https://huggingface.co/efficient-speech/lite-whisper-small) \| 14.96 \| 70.16M \| 153.58M \|
	\| [lite-whisper-small-fast](https://huggingface.co/efficient-speech/lite-whisper-small-fast) \| 14.92 \| 63.11M \| 153.58M \|
	\|   \|   \|   \|   \|
	\| [whisper-medium](https://huggingface.co/openai/whisper-medium) \| 15.12 \| 305.68M \| 456.64M \|
	\| [lite-whisper-medium-acc](https://huggingface.co/efficient-speech/lite-whisper-medium-acc) \| 13.46 \| 269.93M \| 456.64M \|
	\| [lite-whisper-medium](https://huggingface.co/efficient-speech/lite-whisper-medium) \| 14.50 \| 239.99M \| 456.64M \|
	\| [lite-whisper-medium-fast](https://huggingface.co/efficient-speech/lite-whisper-medium-fast) \| 14.52 \| 215.31M \| 456.64M \|

	## Citation

	If you use LiteASR in your research, please cite the following paper:

	```
	@misc{kamahori2025liteasrefficientautomaticspeech,
	title={LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation},
	author={Keisuke Kamahori and Jungo Kasai and Noriyuki Kojima and Baris Kasikci},
	year={2025},
	eprint={2502.20583},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.20583},
	}
	```