Spaces:

subhankarg
/

MagpieTTS_Internal_Demo

Runtime error

App Files Files Community

MagpieTTS_Internal_Demo / examples /speechlm /README.md

subhankarg

Upload folder using huggingface_hub

0558aa4 verified 9 days ago

preview code

raw

history blame contribute delete

10.6 kB

	# NeMo SpeechLM Examples

	This folder contains examples of using NeMo for training and fine-tuning speech language models. The current supported models are those that concatenate audio features with text embeddings and pass them through a GPT decoder, such as:
	- SALM (https://arxiv.org/abs/2310.09424)
	- VoiceTextBlender (https://arxiv.org/abs/2410.17485)
	- Qwen-Audio (https://arxiv.org/abs/2311.07919)

	Please run the scripts in the latest NeMo framework container.

	## Data Preparation

	There are two types of data format that is supported, one is single turn question answering, and the other one is multi-turn multi-modal conversation.

	Below are examples of the jsonl manifest used in NeMo, note that you need to make sure each line in the manifest file is a valid dictionary in json format, but here we format them in multiple lines for better visualization.

	### Single Turn Question Answering
	You'll need to prepare data in the NeMo manifest format (jsonl files), where each line is a python dictionary with some keys, for example:
	```
	{
	"audio_filepath": "path/to/audio.wav",
	"offset": 0.0, # offset of the audio to load in seconds
	"duration": 10.0 , # duration of the audio in seconds, can set to `null` to load the whole audio
	"context": "what is the transcription of the audio?", # text prompt for the audio,
	"answer": "the transcription of the audio",
	}
	```

	For better dataloading efficiency, you can tar indivisual audio files into tar files by following the script in `scripts/speech_recognition/convert_to_tarred_audio_dataset.py`

	### Multi-turn Multi-modal Conversation

	For multi-turn multi-modal conversation, each line in the jsonl manifest should be like:
	```
	{
	"id": "convo_1",
	"conversations":
	[
	{"from": "User", "value": "Can you help summarize the following?", "type": "text"},
	{"from": "User", "value": "123.wav", "type": "audio", "duration": 5.73},
	{"from": "Assistant", "value": "I'm glad to assist you with your request. Here's a summary:", "type": "text"},
	{"from": "Assistant", "value": "Once upon a time..there was a racoon..end of story...", "type": "text"},
	{"from": "User", "value": "Can you further shorten it?", "type": "text"},
	{"from": "Assistant", "value": "Of course!", "type": "text"}
	]
	}
	```
	Here, each conversation is a list of turns, where each turn is a dictionary with:
	- `value` key: the content of the turn, either text string or path to audio files.
	- `from` key: the speaker of the turn, either "User" or "Assistant".
	- `type` key: the type of the turn, either "text" or "audio".
	- `duration` key: the duration of the audio file in seconds, only needed for audio type turns.

	Similarly you can tar them by using the script in `scripts/speech_llm/export_conversations_to_tar.py`.


	### Creating Input Config for Lhotse dataloader
	You can create an input config yaml file (e.g., `input_cfg.yaml`) of mixed formats like:
	```
	- input_cfg:
	- manifest_filepath: /path/to/multi-modal/manifest.json
	type: multimodal_conversation
	audio_locator: "<audio_locator>" # a special string to indicate the audio positions in the combined prompt, can use arbitrary special string but need to make sure it doesn't appear in your regular context
	- manifest_filepath: /path/to/single-turn/manifest.json
	tags:
	# you can specify the default context to use if context field isn't found in the manifest
	default_context: Transcribe the audio into English text without punctuation and capitalization
	type: nemo
	- manifest_filepath: /path/to/single-turn/sharded_manfiest/manifest__OP_1..128_CL_.json
	tarred_audio_filepath: /path/to/single-turn/audio__OP_1..128_CL_.tar
	tags:
	# you can specify the default context to use if context field isn't found in the manifest
	default_context: Transcribe the audio into English text without punctuation and capitalization
	# only tarred single-turn data needs the `nemo_tarred` type, while `multimodal_conversation` is used for tarred multi-modal conversation data
	type: nemo_tarred
	type: group
	```

	To learn more about the dataloader configuration, please refer to the [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration).



	## Training

	An example training script using lhotse dataloader is:
	```bash
	CONFIG_PATH="<NeMo Root>/examples/speechlm/conf/salm"
	CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft

	export WANDB_API_KEY="xxxx" && \
	export CUDA_VISIBLE_DEVICES="0,1" && \
	export HF_TOKEN="xxxxxxx" && \
	export HF_HOME="/home/xxx/.huggingface/" && \
	export HF_HUB_CACHE="/tmp/hf_cache" && \
	export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \ # where to store the base LLM's distributed checkpoints
	python speech_to_text_llm_train.py \
	--config-path=$CONFIG_PATH \
	--config-name=$CONFIG_NAME \
	data.common.add_boa_eoa=true \
	data.train_ds.manifest_filepath=null \ # use input_cfg instead
	data.validation_ds.manifest_filepath=null \
	++data.train_ds.input_cfg=$INPUT_CFG \
	++data.validation_ds.input_cfg=$INPUT_CFG \
	data.train_ds.num_workers=$NUM_WORKERS \
	data.validation_ds.num_workers=$NUM_WORKERS \
	data.common.global_batch_size=$GLOBAL_BATCH \
	data.common.micro_batch_size=$MICRO_BATCH \
	data.common.prompt_format='llama3' \
	strategy.tensor_model_parallel_size=$TP \
	strategy.context_parallel_size=$CP \
	++data.train_ds.batch_size=$MICRO_BATCH \
	++data.train_ds.defer_setup=true \
	++data.train_ds.use_lhotse=true \
	++data.train_ds.is_tarred=false \
	++data.train_ds.use_bucketing=false \
	++data.train_ds.batch_duration=null \
	++data.train_ds.quadratic_duration=null \
	++data.train_ds.bucket_duration_bins=null \
	++data.train_ds.shuffle=false \
	++data.train_ds.shuffle_buffer_size=10000 \
	++data.train_ds.seed=10 \
	++data.train_ds.shard_seed="randomized" \
	++data.train_ds.force_iterable_dataset=true \
	++data.validation_ds.batch_size=$MICRO_BATCH \
	++data.validation_ds.defer_setup=true \
	++data.validation_ds.use_lhotse=true \
	++data.validation_ds.is_tarred=false \
	++data.validation_ds.use_bucketing=false \
	++data.validation_ds.batch_duration=null \
	++data.validation_ds.quadratic_duration=null \
	++data.validation_ds.bucket_duration_bins=null \
	++data.validation_ds.shuffle_buffer_size=10000 \
	++data.validation_ds.seed=10 \
	++data.validation_ds.shard_seed="randomized" \
	++data.validation_ds.shuffle=false \
	++data.validation_ds.metric.name='loss' \ # set to `loss` to only calculate validation loss w/o LLM decoding for faster validation
	++model.data.validation_ds.force_finite=true \
	++model.data.validation_ds.force_map_dataset=true \
	++trainer.use_distributed_sampler=false \
	++trainer.limit_train_batches=2000 \
	trainer.val_check_interval=2000 \ # set to same value as limit_train_batches
	trainer.devices=-1 \
	trainer.max_steps=1000000 \
	trainer.accumulate_grad_batches=$GRAD_ACCUMULATION \
	name="${CONFIG_NAME}_run1" \
	strategy.ckpt_async_save=false \
	max_time_per_run="00:00:30:00" # set to automatically stop the job after 30 minutes
	```

	## Inference

	For running inference, we use the same script as for validation (which has groundtruth answer), but need to set a dummy groundtruth answer for doing inference. An example of inference/evaluation script is:
	```bash
	CONFIG_PATH="<NeMo Root>/examples/speechlm/conf/salm"
	CONFIG_NAME=salm_llama3.2-1b_fc_linear_peft

	export WANDB_API_KEY="xxxx" && \
	export CUDA_VISIBLE_DEVICES="0,1" && \
	export HF_TOKEN="xxxxxxx" && \
	export HF_HOME="/home/xxx/.huggingface/" && \
	export HF_HUB_CACHE="/tmp/hf_cache" && \
	export NEMO_MODELS_CACHE="/tmp/megatron_dist_ckpts" && \
	python speech_to_text_llm_validate.py \
	--config-path=$CONFIG_PATH \
	--config-name=$CONFIG_NAME \
	~data.train_ds \ # remove training config
	data.common.add_boa_eoa=true \
	data.common.global_batch_size=$GLOBAL_BATCH \
	data.common.micro_batch_size=$MICRO_BATCH \
	data.common.prompt_format='llama3' \ # set to the same value as training
	data.validation_ds.metric.name='bleu' \ # set to `bleu` for enabling LLM decoding into text for evaluation
	data.validation_ds.manifest_filepath=null \
	++data.validation_ds.input_cfg=$INPUT_CFG \
	data.validation_ds.num_workers=$NUM_WORKERS \
	++data.validation_ds.batch_size=$MICRO_BATCH \
	++data.validation_ds.defer_setup=true \
	++data.validation_ds.use_lhotse=true \
	++data.validation_ds.use_bucketing=false \
	++data.validation_ds.batch_duration=null \
	++data.validation_ds.quadratic_duration=null \
	++data.validation_ds.bucket_duration_bins=null \
	++data.validation_ds.shuffle=false \
	++model.data.validation_ds.force_finite=true \
	++model.data.validation_ds.force_map_dataset=true \
	++trainer.use_distributed_sampler=false \
	++resume.resume_from_path=$CKPT_PATH \ # path to the checkpoint to load
	++data.validation_ds.write_predictions_to_file=true \
	++data.validation_ds.output_dir=$OUTPUT_DIR \ # directory to save the predictions
	name="${CONFIG_NAME}_run1_eval" \
	trainer.devices=1 \
	data.common.tokens_to_generate=256 \
	++model.inference_config.tokens_to_generate=256 \
	++model.inference_config.temperature=1.0 \
	++model.inference_config.top_k=50 \
	++model.inference_config.top_p=0.95 \
	++model.inference_config.greedy=false \ # set to `true` to use greedy decoding instead of sampling
	++model.inference_config.repetition_penalty=1.0 \
	~logger.wandb # remove wandb logger
	```

	## Notes
	- If you want to drop PEFT, simply add `~model.peft` to the command line arguments.
	- If you want to freeze/finetune each of the model's components, you can set `model.freeze_language_model`, `model.freeze_speech_model` and `model.freeze_modality_adapter` to `true` or `false` in the command line arguments.
	- If you want to use other LLM models that are not in the example config, you can look for them in `nemo/collections/llm/gpt/model` and set the correspondng `model.llm._target_`, `model.llm.config._target_`, then look for their pretrained weights on Huggingface and set `model.llm.pretrained_model` to the corresponding model name.
	- If you want to use Whisper encoder, please note that the current implementation in SpeechLM uses the native Whisper model on Huggingface, which pads or trims audios to a fixed 30s duration.