| | --- |
| | license: mit |
| | language: |
| | - zh |
| | base_model: |
| | - Qwen/Qwen2.5-14B-Instruct |
| | pipeline_tag: audio-text-to-text |
| | datasets: |
| | - EastBrook/COIG-Kun-Aug-Audio |
| | - ReopenAI/Zhihu-KOL-Aug-Audio |
| | --- |
| | |
| | ## Model Details |
| | 本模型是语音-文本->文本的多模态模型。基于seamless-m4t-v2-large的音频编码器和Qwen2.5-14B-Instruct文本模型。 |
| |
|
| |
|
| | 训练: |
| | 一阶段使用清洗后的WeNet中文数据(约6000小时)进行ASR任务训练,此阶段除文本模型外全部可训练。 |
| | 二阶段基于chatgpt-corpus、moss-003-sft-data等数据集的问题,使用Qwen2.5-72B-Instruct-GPTQ-Int4首先继续生成更多轮次的问题,然后使用Qwen2.5-72B-Instruct-GPTQ-Int4生成多轮问题的答案, |
| | 问题使用cosyvoice生成对应音频。生成约620k的多轮语音输入->文本回答数据集。进行语音输入->文本回答的问答任务训练。此阶段除文本模型外全部可训练。 |
| | 部分数据: https://huggingface.co/datasets/ReopenAI/COIG-Kun-Aug-Audio |
| |
|
| | 优势:训练期间文本模型完全冻结,保留原始能力;seamless-m4t-v2-large编码器平均每秒的音频编码成6-7个token,显著小于whisper的50个token。 |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| |
|
| | - **Developed by:** [More Information Needed] |
| | - **Funded by [optional]:** [More Information Needed] |
| | - **Shared by [optional]:** [More Information Needed] |
| | - **Model type:** [More Information Needed] |
| | - **Language(s) (NLP):** [More Information Needed] |
| | - **License:** [More Information Needed] |
| | - **Finetuned from model [optional]:** [More Information Needed] |
| |
|
| | ## Uses |
| | ```python |
| | import requests |
| | import torch |
| | from torchvision import io |
| | from typing import Dict |
| | from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, AutoModel, SeamlessM4Tv2Model, AutoModelForCausalLM, AutoConfig |
| | |
| | |
| | model_path = "EastBrook/Qwen2.5-14B-SeamlessV2" |
| | #model_path = "./" |
| | model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda") |
| | processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
| | print("model_path: ", model_path) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "audio"}, |
| | #{"type": "text", "text": "请详细介绍一下强化学习中的GRPO。"}, |
| | ], |
| | }, |
| | ] |
| | |
| | # Preparation for inference |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | |
| | import librosa |
| | audios = [] |
| | audio_paths = [ |
| | "/mnt/diskhd/Backup/Dataset/WenetSpeech/audio/train/podcast/B00022/X0000005821_5113963_S01270.mp3", |
| | ] |
| | |
| | for path in audio_paths: |
| | audio, sr = librosa.load(path, sr=16000) |
| | audios.append(audio) |
| | |
| | |
| | inputs = processor( |
| | text=[text], |
| | images=None, |
| | videos=None, |
| | #audios=None, |
| | audios=audios, |
| | padding=True, |
| | return_tensors="pt", |
| | ) |
| | inputs = inputs.to("cuda") |
| | |
| | generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False) |
| | generated_ids_trimmed = [ |
| | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | output_text = processor.batch_decode( |
| | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print("output_text: ", output_text) |
| | ``` |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | 模型主要使用中文音频训练,英文能力较弱。 |