| --- |
| library_name: transformers |
| license: mit |
| language: |
| - multilingual |
| - ar |
| - zh |
| - cs |
| - da |
| - nl |
| - en |
| - fi |
| - fr |
| - de |
| - he |
| - hu |
| - it |
| - ja |
| - ko |
| - no |
| - pl |
| - pt |
| - ru |
| - es |
| - sv |
| - th |
| - tr |
| - uk |
| tags: |
| - nlp |
| - code |
| - audio |
| - automatic-speech-recognition |
| - speech-summarization |
| - speech-translation |
| - phi-4-multimodal |
| - phi |
| - phi-4-mini |
| base_model: microsoft/Phi-4-multimodal-instruct |
| --- |
| |
| # Phi-4-Audio |
|
|
| **Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition). |
|
|
| By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities. |
|
|
| ## Usage & Performance |
|
|
| This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters. |
|
|
| ### Key Improvements |
|
|
| Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU: |
|
|
| * **Reduced Footprint:** Parameter count reduced by approximately **450 Million**. |
| * **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**. |
| * **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter. |
|
|
| ## Uses |
|
|
| ### Intended Use |
|
|
| * **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio. |
| * **Speech Translation:** Direct speech-to-text translation. |
| * **Audio Summarization:** Generating summaries from audio recordings. |
| * **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs. |
|
|
| ### Out of Scope |
|
|
| - **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped. |
|
|
| ## How to Get Started |
|
|
| The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported. |
|
|
| ```python |
| import torch |
| from torch import nn |
| from io import BytesIO |
| from urllib.request import urlopen |
| from soundfile import read |
| from transformers import ( |
| AutoModelForCausalLM, |
| AutoProcessor, |
| Phi4MultimodalForCausalLM, |
| Phi4MultimodalModel, |
| ) |
| |
| |
| class StrippedVisionModule(nn.Module): |
| def __init__(self): |
| super().__init__() |
| |
| def forward( |
| self, |
| **kwargs, |
| ): |
| raise ValueError("Vision is not supported") |
| |
| |
| def strip_vision_inplace( |
| model: Phi4MultimodalForCausalLM | Phi4MultimodalModel, |
| ) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel: |
| passed_model = model |
| |
| if isinstance(model, Phi4MultimodalForCausalLM): |
| model = model.model |
| |
| emb_ext = model.embed_tokens_extend |
| if hasattr(emb_ext, "image_embed"): |
| emb_ext.image_embed = StrippedVisionModule() |
| if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"): |
| emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule() |
| if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"): |
| emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule() |
| |
| try: |
| torch.cuda.empty_cache() |
| except: |
| pass |
| |
| return passed_model |
| |
| |
| model_path = "JacobLinCool/phi-4-audio" |
| device = "cuda" |
| model = AutoModelForCausalLM.from_pretrained( |
| model_path, device_map=device, dtype=torch.bfloat16 |
| ) |
| processor = AutoProcessor.from_pretrained(model_path) |
| strip_vision_inplace(model) |
| |
| |
| audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3" |
| audio, samplerate = read(BytesIO(urlopen(audio_url).read())) |
| |
| user_prompt = "<|user|>" |
| assistant_prompt = "<|assistant|>" |
| prompt_suffix = "<|end|>" |
| speech_prompt = "Transcribe the audio clip into text." |
| prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}" |
| |
| inputs = processor( |
| text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt" |
| ).to(device) |
| |
| generate_ids = model.generate(**inputs) |
| response = processor.batch_decode( |
| generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True |
| )[0] |
| |
| print(f"{response=}") |
| ``` |
|
|
| ## Model Details |
|
|
| - Base Architecture: Phi-4 Multimodal |
| - Modifications: |
| - Removed `embed_tokens_extend.image_embed` |
| - Removed `audio_embed.down_proj_for_vision_speech` |
| - Removed `audio_embed.up_proj_for_vision_speech` |
|
|
| ## Comparisons |
|
|
| ### Parameter Count |
|
|
| | Model | Total Parameters | Reduction | |
| | --- | --- | --- | |
| | Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - | |
| | Phi-4-Audio | 4,289,848,960 (4.29B) | -454M | |
|
|
| ### Benchmark Results |
|
|
| Tested on NVIDIA RTX 5090, `torch.bfloat16`. |
|
|
| | Metric | Original Model | Phi-4-Audio | Delta | |
| | --- | --- | --- | --- | |
| | Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB | |
| | Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar | |
|
|
| ## Citation |
|
|
| If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications. |
|
|
| ```bibtex |
| @article{abouelenin2025phi, |
| title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras}, |
| author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others}, |
| journal={arXiv preprint arXiv:2503.01743}, |
| year={2025} |
| } |
| ``` |
|
|