Phi-4-Audio

Phi-4-Audio is a highly efficient adaptation of the Phi-4-multimodal-instruct model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).

By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.

Usage & Performance

This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.

Key Improvements

Comparing Phi-4-Audio against the original Phi-4-multimodal-instruct on a single NVIDIA RTX 5090 GPU:

Reduced Footprint: Parameter count reduced by approximately 450 Million.
Lower VRAM Usage: Peak inference memory usage reduced by ~10% (0.84 GB saved).
Same Audio Performance: Retains full audio-understanding capabilities while running lighter.

Uses

Intended Use

Automatic Speech Recognition (ASR): High-fidelity transcription of spoken audio.
Speech Translation: Direct speech-to-text translation.
Audio Summarization: Generating summaries from audio recordings.
Spoken Instruction Tuning: Fine-tuning on pure audio-text pairs.

Out of Scope

Image/Vision Tasks: This model cannot process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.

How to Get Started

The model is fully compatible with the Hugging Face transformers library. You can use it exactly like the original model, but inputting images is not supported.

import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    Phi4MultimodalForCausalLM,
    Phi4MultimodalModel,
)


class StrippedVisionModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(
        self,
        **kwargs,
    ):
        raise ValueError("Vision is not supported")


def strip_vision_inplace(
    model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
    passed_model = model

    if isinstance(model, Phi4MultimodalForCausalLM):
        model = model.model

    emb_ext = model.embed_tokens_extend
    if hasattr(emb_ext, "image_embed"):
        emb_ext.image_embed = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
        emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
        emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()

    try:
        torch.cuda.empty_cache()
    except:
        pass

    return passed_model


model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)


audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))

user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"

inputs = processor(
    text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)

generate_ids = model.generate(**inputs)
response = processor.batch_decode(
    generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]

print(f"{response=}")

Model Details

Base Architecture: Phi-4 Multimodal
Modifications:
- Removed embed_tokens_extend.image_embed
- Removed audio_embed.down_proj_for_vision_speech
- Removed audio_embed.up_proj_for_vision_speech

Comparisons

Parameter Count

Model	Total Parameters	Reduction
Phi-4-multimodal-instruct	4,743,988,032 (4.74B)	-
Phi-4-Audio	4,289,848,960 (4.29B)	-454M

Benchmark Results

Tested on NVIDIA RTX 5090, torch.bfloat16.

Metric	Original Model	Phi-4-Audio	Delta
Peak Memory (GB)	8.88 GB	8.04 GB	-0.84 GB
Inference Speed (Warm)	~100.5 tokens/s	~100.5 tokens/s	Similar

Citation

If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.

@article{abouelenin2025phi,
  title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
  author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
  journal={arXiv preprint arXiv:2503.01743},
  year={2025}
}

Downloads last month: 46

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for JacobLinCool/phi-4-audio

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(44)

this model