File size: 3,017 Bytes

8ddc011
e763a34
8ddc011

---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- agkphysics/AudioSet
language:
- en
---
# USAD: Universal Speech and Audio Representation via Distillation

**Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers.
Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

[👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)

---

## 🗂️ Models

USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.

| Model      | Parameters | Dim  | Layer | Checkpoint                                        |
| ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
| USAD Small | 24M        | 384  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
| USAD Base  | 94M        | 768  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Base)  |
| USAD Large | 330M       | 1024 | 24    | [link](https://huggingface.co/MIT-SLS/USAD-Large) |

---


## 🚀 How To Use

**Installation**
```
pip install -U transformers
```

**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(wav)

# result["x"]:              model final output (batch_size, seq_len)
# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
```

See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.

---

## 📖 Citation

```bibtex
@article{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  journal={arXiv preprint arXiv:2506.18843},
  year={2025}
}
```

---

## 🙏 Acknowledgement

Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.