Llama-3.2-3B-Instruct-FlashHead

FlashHead

GitHub

Optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency

Quickstart

pip install flash-head
vllm serve embedl/Llama-3.2-3B-Instruct-FlashHead

Model Details

Field Value
Base Model Llama-3.2-3B-Instruct
Input / Output Text → Text
Release Date 2025-12-08
Version 1.0
Optimizations FlashHead LM Head
Developers Embedl
Licenses Upstream: Meta Llama 3.2 License. Built with Llama.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • vLLM Plugin Integration - compatible with vLLM (0.14.0+) via the flash-head plugin.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision Tokens/sec Speedup vs BF16
BF16 baseline 54 1.0×
FlashHead (Embedl) 58 1.07×
W4A16 baseline 141 2.61×
FlashHead W4A16 (Embedl) 177 3.28×

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.


Accuracy (Parity with Baseline)

Method MMLU-Pro IFEval BBH TruthfulQA GSM8K
Baseline 0.31 0.57 0.57 0.57 0.77
FlashHead 0.31 0.56 0.57 0.58 0.77

FlashHead closely matches baseline accuracy.


Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import LLM, SamplingParams

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Limitations

  • Requires vLLM >= 0.14.0
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Advanced mixed precision quantization
  • Huggingface transformers generation
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries models@embedl.com

Technical Issues & Early Access https://github.com/embedl/flash-head

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: models@embedl.com

Downloads last month
64
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Llama-3.2-3B-Instruct-FlashHead

Finetuned
(1553)
this model

Collection including embedl/Llama-3.2-3B-Instruct-FlashHead