File size: 3,017 Bytes
8ddc011
e763a34
8ddc011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: cc-by-nc-sa-4.0
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
library_name: transformers
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- agkphysics/AudioSet
language:
- en
---
# USAD: Universal Speech and Audio Representation via Distillation

**Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers.
Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

[πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2506.18843)

---

## πŸ—‚οΈ Models

USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.

| Model      | Parameters | Dim  | Layer | Checkpoint                                        |
| ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
| USAD Small | 24M        | 384  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
| USAD Base  | 94M        | 768  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Base)  |
| USAD Large | 330M       | 1024 | 24    | [link](https://huggingface.co/MIT-SLS/USAD-Large) |

---


## πŸš€ How To Use

**Installation**
```
pip install -U transformers
```

**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(wav)

# result["x"]:              model final output (batch_size, seq_len)
# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
```

See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.

---

## πŸ“– Citation

```bibtex
@article{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  journal={arXiv preprint arXiv:2506.18843},
  year={2025}
}
```

---

## πŸ™ Acknowledgement

Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.