Llama-3.2-3B-Instruct-FlashHead

Optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
vLLM plugin via flash-head

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency

Quickstart

pip install flash-head
vllm serve embedl/Llama-3.2-3B-Instruct-FlashHead

Model Details

Field	Value
Base Model	Llama-3.2-3B-Instruct
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head
Developers	Embedl
Licenses	Upstream: Meta Llama 3.2 License. Built with Llama. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
vLLM Plugin Integration - compatible with vLLM (0.14.0+) via the flash-head plugin.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	54	1.0×
FlashHead (Embedl)	58	1.07×
W4A16 baseline	141	2.61×
FlashHead W4A16 (Embedl)	177	3.28×

FlashHead improves end-to-end speed by 1.26× over state-of-the-art, while maintaining full accuracy parity.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	IFEval	BBH	TruthfulQA	GSM8K
Baseline	0.31	0.57	0.57	0.57	0.77
FlashHead	0.31	0.56	0.57	0.58	0.77

FlashHead closely matches baseline accuracy.

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import LLM, SamplingParams

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)