# Speech Self-Supervised Learning This directory contains example scripts to self-supervised speech models. There are two main types of supported self-supervised learning methods: - [Wav2vec-BERT](https://arxiv.org/abs/2108.06209): `speech_pre_training.py` - [NEST](https://arxiv.org/abs/2408.13106): `masked_token_pred_pretrain.py` - For downstream tasks that use NEST as multi-layer feature extractor, please refer to `./downstream/speech_classification_mfa_train.py` - For extracting multi-layer features from NEST, please refer to `/scripts/ssl/extract_features.py` - For using NEST as weight initialization for downstream tasks, please refer to the usage of [maybe_init_from_pretrained_checkpoint](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1242). For their corresponding usage, please refer to the example yaml config: - Wav2vec-BERT: `examples/asr/conf/ssl/fastconformer/fast-conformer.yaml` - NEST: `examples/asr/conf/ssl/nest/nest_fast-conformer.yaml` The dataset format follows that of ASR models, but no groundtruth transcriptions are needed. For example, the jsonl file specified in `manifest_filepath` should look like: ``` {"audio_filepath": "path/to/audio1.wav", "duration": 10.0, "text": ""} {"audio_filepath": "path/to/audio2.wav", "duration": 5.0, "text": ""} ``` Please refer to the [ASR dataset documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#preparing-custom-asr-data%60) for more details. For most efficient data loading, please refer to - [lhotse dataloading](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#lhotse-dataloading) - [pre-compute bucket durations](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#pre-computing-bucket-duration-bins) - [optimizing GPU memory usage](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#pushing-gpu-utilization-to-the-limits-with-bucketing-and-oomptimizer)