# NeMo SpeechLM Examples This folder contains examples of using NeMo for training and fine-tuning speech language models. The current supported models are those that concatenate audio features with text embeddings and pass them through a GPT decoder, such as: - SALM (https://arxiv.org/abs/2310.09424) - VoiceTextBlender (https://arxiv.org/abs/2410.17485) - Qwen-Audio (https://arxiv.org/abs/2311.07919) Please run the scripts in the latest NeMo framework container. ## Data Preparation There are two types of data format that is supported, one is **single turn question answering**, and the other one is **multi-turn multi-modal conversation**. Below are examples of the jsonl manifest used in NeMo, note that you need to make sure each line in the manifest file is a valid dictionary in json format, but here we format them in multiple lines for better visualization. ### Single Turn Question Answering You'll need to prepare data in the NeMo manifest format (jsonl files), where each line is a python dictionary with some keys, for example: ``` { "audio_filepath": "path/to/audio.wav", "offset": 0.0, # offset of the audio to load in seconds "duration": 10.0 , # duration of the audio in seconds, can set to `null` to load the whole audio "context": "what is the transcription of the audio?", # text prompt for the audio, "answer": "the transcription of the audio", } ``` For better dataloading efficiency, you can tar indivisual audio files into tar files by following the script in `scripts/speech_recognition/convert_to_tarred_audio_dataset.py` ### Multi-turn Multi-modal Conversation For multi-turn multi-modal conversation, each line in the jsonl manifest should be like: ``` { "id": "convo_1", "conversations": [ {"from": "User", "value": "Can you help summarize the following?", "type": "text"}, {"from": "User", "value": "123.wav", "type": "audio", "duration": 5.73}, {"from": "Assistant", "value": "I'm glad to assist you with your request. Here's a summary:", "type": "text"}, {"from": "Assistant", "value": "Once upon a time..there was a racoon..end of story...", "type": "text"}, {"from": "User", "value": "Can you further shorten it?", "type": "text"}, {"from": "Assistant", "value": "Of course!", "type": "text"} ] } ``` Here, each conversation is a list of turns, where each turn is a dictionary with: - `value` key: the content of the turn, either text string or path to audio files. - `from` key: the speaker of the turn, either "User" or "Assistant". - `type` key: the type of the turn, either "text" or "audio". - `duration` key: the duration of the audio file in seconds, only needed for audio type turns. Similarly you can tar them by using the script in `scripts/speech_llm/export_conversations_to_tar.py`. ### Creating Input Config for Lhotse dataloader You can create an input config yaml file (e.g., `input_cfg.yaml`) of mixed formats like: ``` - input_cfg: - manifest_filepath: /path/to/multi-modal/manifest.json type: multimodal_conversation audio_locator: "" # a special string to indicate the audio positions in the combined prompt, can use arbitrary special string but need to make sure it doesn't appear in your regular context - manifest_filepath: /path/to/single-turn/manifest.json tags: # you can specify the default context to use if context field isn't found in the manifest default_context: Transcribe the audio into English text without punctuation and capitalization type: nemo - manifest_filepath: /path/to/single-turn/sharded_manfiest/manifest__OP_1..128_CL_.json tarred_audio_filepath: /path/to/single-turn/audio__OP_1..128_CL_.tar tags: # you can specify the default context to use if context field isn't found in the manifest default_context: Transcribe the audio into English text without punctuation and capitalization # only tarred single-turn data needs the `nemo_tarred` type, while `multimodal_conversation` is used for tarred multi-modal conversation data type: nemo_tarred type: group ``` To learn more about the dataloader configuration, please refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration). ## Training An example training script using lhotse dataloader is: ```bash CONFIG_PATH="/examples/speechlm/conf/salm" CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft export WANDB_API_KEY="xxxx" && \ export CUDA_VISIBLE_DEVICES="0,1" && \ export HF_TOKEN="xxxxxxx" && \ export HF_HOME="/home/xxx/.huggingface/" && \ export HF_HUB_CACHE="/tmp/hf_cache" && \ export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \ # where to store the base LLM's distributed checkpoints python speech_to_text_llm_train.py \ --config-path=$CONFIG_PATH \ --config-name=$CONFIG_NAME \ data.common.add_boa_eoa=true \ data.train_ds.manifest_filepath=null \ # use input_cfg instead data.validation_ds.manifest_filepath=null \ ++data.train_ds.input_cfg=$INPUT_CFG \ ++data.validation_ds.input_cfg=$INPUT_CFG \ data.train_ds.num_workers=$NUM_WORKERS \ data.validation_ds.num_workers=$NUM_WORKERS \ data.common.global_batch_size=$GLOBAL_BATCH \ data.common.micro_batch_size=$MICRO_BATCH \ data.common.prompt_format='llama3' \ strategy.tensor_model_parallel_size=$TP \ strategy.context_parallel_size=$CP \ ++data.train_ds.batch_size=$MICRO_BATCH \ ++data.train_ds.defer_setup=true \ ++data.train_ds.use_lhotse=true \ ++data.train_ds.is_tarred=false \ ++data.train_ds.use_bucketing=false \ ++data.train_ds.batch_duration=null \ ++data.train_ds.quadratic_duration=null \ ++data.train_ds.bucket_duration_bins=null \ ++data.train_ds.shuffle=false \ ++data.train_ds.shuffle_buffer_size=10000 \ ++data.train_ds.seed=10 \ ++data.train_ds.shard_seed="randomized" \ ++data.train_ds.force_iterable_dataset=true \ ++data.validation_ds.batch_size=$MICRO_BATCH \ ++data.validation_ds.defer_setup=true \ ++data.validation_ds.use_lhotse=true \ ++data.validation_ds.is_tarred=false \ ++data.validation_ds.use_bucketing=false \ ++data.validation_ds.batch_duration=null \ ++data.validation_ds.quadratic_duration=null \ ++data.validation_ds.bucket_duration_bins=null \ ++data.validation_ds.shuffle_buffer_size=10000 \ ++data.validation_ds.seed=10 \ ++data.validation_ds.shard_seed="randomized" \ ++data.validation_ds.shuffle=false \ ++data.validation_ds.metric.name='loss' \ # set to `loss` to only calculate validation loss w/o LLM decoding for faster validation ++model.data.validation_ds.force_finite=true \ ++model.data.validation_ds.force_map_dataset=true \ ++trainer.use_distributed_sampler=false \ ++trainer.limit_train_batches=2000 \ trainer.val_check_interval=2000 \ # set to same value as limit_train_batches trainer.devices=-1 \ trainer.max_steps=1000000 \ trainer.accumulate_grad_batches=$GRAD_ACCUMULATION \ name="${CONFIG_NAME}_run1" \ strategy.ckpt_async_save=false \ max_time_per_run="00:00:30:00" # set to automatically stop the job after 30 minutes ``` ## Inference For running inference, we use the same script as for validation (which has groundtruth answer), but need to set a dummy groundtruth answer for doing inference. An example of inference/evaluation script is: ```bash CONFIG_PATH="/examples/speechlm/conf/salm" CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft export WANDB_API_KEY="xxxx" && \ export CUDA_VISIBLE_DEVICES="0,1" && \ export HF_TOKEN="xxxxxxx" && \ export HF_HOME="/home/xxx/.huggingface/" && \ export HF_HUB_CACHE="/tmp/hf_cache" && \ export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \ python speech_to_text_llm_validate.py \ --config-path=$CONFIG_PATH \ --config-name=$CONFIG_NAME \ ~data.train_ds \ # remove training config data.common.add_boa_eoa=true \ data.common.global_batch_size=$GLOBAL_BATCH \ data.common.micro_batch_size=$MICRO_BATCH \ data.common.prompt_format='llama3' \ # set to the same value as training data.validation_ds.metric.name='bleu' \ # set to `bleu` for enabling LLM decoding into text for evaluation data.validation_ds.manifest_filepath=null \ ++data.validation_ds.input_cfg=$INPUT_CFG \ data.validation_ds.num_workers=$NUM_WORKERS \ ++data.validation_ds.batch_size=$MICRO_BATCH \ ++data.validation_ds.defer_setup=true \ ++data.validation_ds.use_lhotse=true \ ++data.validation_ds.use_bucketing=false \ ++data.validation_ds.batch_duration=null \ ++data.validation_ds.quadratic_duration=null \ ++data.validation_ds.bucket_duration_bins=null \ ++data.validation_ds.shuffle=false \ ++model.data.validation_ds.force_finite=true \ ++model.data.validation_ds.force_map_dataset=true \ ++trainer.use_distributed_sampler=false \ ++resume.resume_from_path=$CKPT_PATH \ # path to the checkpoint to load ++data.validation_ds.write_predictions_to_file=true \ ++data.validation_ds.output_dir=$OUTPUT_DIR \ # directory to save the predictions name="${CONFIG_NAME}_run1_eval" \ trainer.devices=1 \ data.common.tokens_to_generate=256 \ ++model.inference_config.tokens_to_generate=256 \ ++model.inference_config.temperature=1.0 \ ++model.inference_config.top_k=50 \ ++model.inference_config.top_p=0.95 \ ++model.inference_config.greedy=false \ # set to `true` to use greedy decoding instead of sampling ++model.inference_config.repetition_penalty=1.0 \ ~logger.wandb # remove wandb logger ``` ## Notes - If you want to drop PEFT, simply add `~model.peft` to the command line arguments. - If you want to freeze/finetune each of the model's components, you can set `model.freeze_language_model`, `model.freeze_speech_model` and `model.freeze_modality_adapter` to `true` or `false` in the command line arguments. - If you want to use other LLM models that are not in the example config, you can look for them in `nemo/collections/llm/gpt/model` and set the correspondng `model.llm._target_`, `model.llm.config._target_`, then look for their pretrained weights on Huggingface and set `model.llm.pretrained_model` to the corresponding model name. - If you want to use Whisper encoder, please note that the current implementation in SpeechLM uses the native Whisper model on Huggingface, which pads or trims audios to a fixed 30s duration.