subhankarg's picture
Upload folder using huggingface_hub
0558aa4 verified
# NeMo SpeechLM Examples
This folder contains examples of using NeMo for training and fine-tuning speech language models. The current supported models are those that concatenate audio features with text embeddings and pass them through a GPT decoder, such as:
- SALM (https://arxiv.org/abs/2310.09424)
- VoiceTextBlender (https://arxiv.org/abs/2410.17485)
- Qwen-Audio (https://arxiv.org/abs/2311.07919)
Please run the scripts in the latest NeMo framework container.
## Data Preparation
There are two types of data format that is supported, one is **single turn question answering**, and the other one is **multi-turn multi-modal conversation**.
Below are examples of the jsonl manifest used in NeMo, note that you need to make sure each line in the manifest file is a valid dictionary in json format, but here we format them in multiple lines for better visualization.
### Single Turn Question Answering
You'll need to prepare data in the NeMo manifest format (jsonl files), where each line is a python dictionary with some keys, for example:
```
{
"audio_filepath": "path/to/audio.wav",
"offset": 0.0, # offset of the audio to load in seconds
"duration": 10.0 , # duration of the audio in seconds, can set to `null` to load the whole audio
"context": "what is the transcription of the audio?", # text prompt for the audio,
"answer": "the transcription of the audio",
}
```
For better dataloading efficiency, you can tar indivisual audio files into tar files by following the script in `scripts/speech_recognition/convert_to_tarred_audio_dataset.py`
### Multi-turn Multi-modal Conversation
For multi-turn multi-modal conversation, each line in the jsonl manifest should be like:
```
{
"id": "convo_1",
"conversations":
[
{"from": "User", "value": "Can you help summarize the following?", "type": "text"},
{"from": "User", "value": "123.wav", "type": "audio", "duration": 5.73},
{"from": "Assistant", "value": "I'm glad to assist you with your request. Here's a summary:", "type": "text"},
{"from": "Assistant", "value": "Once upon a time..there was a racoon..end of story...", "type": "text"},
{"from": "User", "value": "Can you further shorten it?", "type": "text"},
{"from": "Assistant", "value": "Of course!", "type": "text"}
]
}
```
Here, each conversation is a list of turns, where each turn is a dictionary with:
- `value` key: the content of the turn, either text string or path to audio files.
- `from` key: the speaker of the turn, either "User" or "Assistant".
- `type` key: the type of the turn, either "text" or "audio".
- `duration` key: the duration of the audio file in seconds, only needed for audio type turns.
Similarly you can tar them by using the script in `scripts/speech_llm/export_conversations_to_tar.py`.
### Creating Input Config for Lhotse dataloader
You can create an input config yaml file (e.g., `input_cfg.yaml`) of mixed formats like:
```
- input_cfg:
- manifest_filepath: /path/to/multi-modal/manifest.json
type: multimodal_conversation
audio_locator: "<audio_locator>" # a special string to indicate the audio positions in the combined prompt, can use arbitrary special string but need to make sure it doesn't appear in your regular context
- manifest_filepath: /path/to/single-turn/manifest.json
tags:
# you can specify the default context to use if context field isn't found in the manifest
default_context: Transcribe the audio into English text without punctuation and capitalization
type: nemo
- manifest_filepath: /path/to/single-turn/sharded_manfiest/manifest__OP_1..128_CL_.json
tarred_audio_filepath: /path/to/single-turn/audio__OP_1..128_CL_.tar
tags:
# you can specify the default context to use if context field isn't found in the manifest
default_context: Transcribe the audio into English text without punctuation and capitalization
# only tarred single-turn data needs the `nemo_tarred` type, while `multimodal_conversation` is used for tarred multi-modal conversation data
type: nemo_tarred
type: group
```
To learn more about the dataloader configuration, please refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration).
## Training
An example training script using lhotse dataloader is:
```bash
CONFIG_PATH="<NeMo Root>/examples/speechlm/conf/salm"
CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft
export WANDB_API_KEY="xxxx" && \
export CUDA_VISIBLE_DEVICES="0,1" && \
export HF_TOKEN="xxxxxxx" && \
export HF_HOME="/home/xxx/.huggingface/" && \
export HF_HUB_CACHE="/tmp/hf_cache" && \
export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \ # where to store the base LLM's distributed checkpoints
python speech_to_text_llm_train.py \
--config-path=$CONFIG_PATH \
--config-name=$CONFIG_NAME \
data.common.add_boa_eoa=true \
data.train_ds.manifest_filepath=null \ # use input_cfg instead
data.validation_ds.manifest_filepath=null \
++data.train_ds.input_cfg=$INPUT_CFG \
++data.validation_ds.input_cfg=$INPUT_CFG \
data.train_ds.num_workers=$NUM_WORKERS \
data.validation_ds.num_workers=$NUM_WORKERS \
data.common.global_batch_size=$GLOBAL_BATCH \
data.common.micro_batch_size=$MICRO_BATCH \
data.common.prompt_format='llama3' \
strategy.tensor_model_parallel_size=$TP \
strategy.context_parallel_size=$CP \
++data.train_ds.batch_size=$MICRO_BATCH \
++data.train_ds.defer_setup=true \
++data.train_ds.use_lhotse=true \
++data.train_ds.is_tarred=false \
++data.train_ds.use_bucketing=false \
++data.train_ds.batch_duration=null \
++data.train_ds.quadratic_duration=null \
++data.train_ds.bucket_duration_bins=null \
++data.train_ds.shuffle=false \
++data.train_ds.shuffle_buffer_size=10000 \
++data.train_ds.seed=10 \
++data.train_ds.shard_seed="randomized" \
++data.train_ds.force_iterable_dataset=true \
++data.validation_ds.batch_size=$MICRO_BATCH \
++data.validation_ds.defer_setup=true \
++data.validation_ds.use_lhotse=true \
++data.validation_ds.is_tarred=false \
++data.validation_ds.use_bucketing=false \
++data.validation_ds.batch_duration=null \
++data.validation_ds.quadratic_duration=null \
++data.validation_ds.bucket_duration_bins=null \
++data.validation_ds.shuffle_buffer_size=10000 \
++data.validation_ds.seed=10 \
++data.validation_ds.shard_seed="randomized" \
++data.validation_ds.shuffle=false \
++data.validation_ds.metric.name='loss' \ # set to `loss` to only calculate validation loss w/o LLM decoding for faster validation
++model.data.validation_ds.force_finite=true \
++model.data.validation_ds.force_map_dataset=true \
++trainer.use_distributed_sampler=false \
++trainer.limit_train_batches=2000 \
trainer.val_check_interval=2000 \ # set to same value as limit_train_batches
trainer.devices=-1 \
trainer.max_steps=1000000 \
trainer.accumulate_grad_batches=$GRAD_ACCUMULATION \
name="${CONFIG_NAME}_run1" \
strategy.ckpt_async_save=false \
max_time_per_run="00:00:30:00" # set to automatically stop the job after 30 minutes
```
## Inference
For running inference, we use the same script as for validation (which has groundtruth answer), but need to set a dummy groundtruth answer for doing inference. An example of inference/evaluation script is:
```bash
CONFIG_PATH="<NeMo Root>/examples/speechlm/conf/salm"
CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft
export WANDB_API_KEY="xxxx" && \
export CUDA_VISIBLE_DEVICES="0,1" && \
export HF_TOKEN="xxxxxxx" && \
export HF_HOME="/home/xxx/.huggingface/" && \
export HF_HUB_CACHE="/tmp/hf_cache" && \
export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \
python speech_to_text_llm_validate.py \
--config-path=$CONFIG_PATH \
--config-name=$CONFIG_NAME \
~data.train_ds \ # remove training config
data.common.add_boa_eoa=true \
data.common.global_batch_size=$GLOBAL_BATCH \
data.common.micro_batch_size=$MICRO_BATCH \
data.common.prompt_format='llama3' \ # set to the same value as training
data.validation_ds.metric.name='bleu' \ # set to `bleu` for enabling LLM decoding into text for evaluation
data.validation_ds.manifest_filepath=null \
++data.validation_ds.input_cfg=$INPUT_CFG \
data.validation_ds.num_workers=$NUM_WORKERS \
++data.validation_ds.batch_size=$MICRO_BATCH \
++data.validation_ds.defer_setup=true \
++data.validation_ds.use_lhotse=true \
++data.validation_ds.use_bucketing=false \
++data.validation_ds.batch_duration=null \
++data.validation_ds.quadratic_duration=null \
++data.validation_ds.bucket_duration_bins=null \
++data.validation_ds.shuffle=false \
++model.data.validation_ds.force_finite=true \
++model.data.validation_ds.force_map_dataset=true \
++trainer.use_distributed_sampler=false \
++resume.resume_from_path=$CKPT_PATH \ # path to the checkpoint to load
++data.validation_ds.write_predictions_to_file=true \
++data.validation_ds.output_dir=$OUTPUT_DIR \ # directory to save the predictions
name="${CONFIG_NAME}_run1_eval" \
trainer.devices=1 \
data.common.tokens_to_generate=256 \
++model.inference_config.tokens_to_generate=256 \
++model.inference_config.temperature=1.0 \
++model.inference_config.top_k=50 \
++model.inference_config.top_p=0.95 \
++model.inference_config.greedy=false \ # set to `true` to use greedy decoding instead of sampling
++model.inference_config.repetition_penalty=1.0 \
~logger.wandb # remove wandb logger
```
## Notes
- If you want to drop PEFT, simply add `~model.peft` to the command line arguments.
- If you want to freeze/finetune each of the model's components, you can set `model.freeze_language_model`, `model.freeze_speech_model` and `model.freeze_modality_adapter` to `true` or `false` in the command line arguments.
- If you want to use other LLM models that are not in the example config, you can look for them in `nemo/collections/llm/gpt/model` and set the correspondng `model.llm._target_`, `model.llm.config._target_`, then look for their pretrained weights on Huggingface and set `model.llm.pretrained_model` to the corresponding model name.
- If you want to use Whisper encoder, please note that the current implementation in SpeechLM uses the native Whisper model on Huggingface, which pads or trims audios to a fixed 30s duration.