Phi-4-mini-instruct โ€” GGUF (iPhone-optimized)

GGUF quantizations of microsoft/Phi-4-mini-instruct, built and optimized for on-device inference on iPhone, iPad, and Apple Silicon Macs via llama.cpp or apps that wrap it (e.g. Haplo).

Built and quantized by jc-builds for the Haplo ecosystem. Original weights ยฉ Microsoft Corporation, redistributed under the MIT License per the upstream license.

TL;DR

A 3.8B-parameter reasoning-focused model from Microsoft. Punches well above its weight on math, code, and structured-reasoning benchmarks โ€” beating models 5โ€“10ร— its size on MATH and GPQA. 128k context with YaRN. MIT-licensed, the most permissive license among comparable models. Best small-model pick if you want "thinking mode" without paying the latency tax of a 7B+ model.

Available quantizations

File Size Bits/weight Recommended use
Phi-4-mini-instruct-Q4_K_M.gguf 2.3 GB 4.8 Default โ€” best size/quality tradeoff for phone & laptop
Phi-4-mini-instruct-Q5_K_M.gguf 2.6 GB 5.7 Slightly better reasoning, ~13% bigger; recommended for iPad / Mac
Phi-4-mini-instruct-Q8_0.gguf 3.8 GB 8.5 Near-FP16 quality; only worth it on Apple Silicon Mac

Pick Q4_K_M for general use. Phi-4-mini's reasoning quality holds up well at Q4_K_M. Avoid Q3 quants for this model โ€” reasoning quality degrades noticeably.

Performance on Apple Silicon

Approximate decode throughput at single-batch greedy decode, 2048-token context. Measured with llama-cli.

Device RAM Q4_K_M tok/s Notes
iPhone 15 Pro 8 GB ~18 tok/s Smooth, but reasoning mode adds latency before the first token
iPhone 16 Pro 8 GB ~22 tok/s Recommended phone target
iPad Pro M2 8 GB ~38 tok/s Snappy
MacBook Pro M3 16 GB ~70 tok/s Effectively instant

Reference numbers โ€” Q5_K_M and Q8_0 are roughly 15% / 40% slower than Q4_K_M.

How to use

1. Haplo (iPhone / iPad / Mac)

The model appears automatically in Haplo's model browser on Kuzco-1.1.0+ builds. The download URL for Q4_K_M is:

https://huggingface.co/jc-builds/Phi-4-mini-instruct-GGUF/resolve/main/Phi-4-mini-instruct-Q4_K_M.gguf

2. llama.cpp (CLI)

huggingface-cli download jc-builds/Phi-4-mini-instruct-GGUF Phi-4-mini-instruct-Q4_K_M.gguf --local-dir .

./llama-cli \
  -m Phi-4-mini-instruct-Q4_K_M.gguf \
  -p "If x^2 + 3x - 10 = 0, find x." \
  -n 512 \
  --temp 0.0

For reasoning-heavy tasks set --temp 0.0 and let the model deterministically work through the problem.

3. Ollama

cat <<'EOF' > Modelfile
FROM ./Phi-4-mini-instruct-Q4_K_M.gguf
PARAMETER temperature 0.0
PARAMETER top_p 1.0
EOF
ollama create phi-4-mini -f Modelfile
ollama run phi-4-mini

Long context (128k via YaRN)

Phi-4-mini is trained at 64k context and can stretch to 128k tokens via YaRN extrapolation. To enable: most clients pick this up automatically from the GGUF metadata. If yours doesn't, set --rope-scaling yarn --rope-scale 2.0 in llama.cpp.

Sampling defaults

For reasoning tasks (math, code, structured output): temperature=0.0, top_p=1.0. For chat / general output: temperature=0.7, top_p=0.9.

The GGUF stores no defaults โ€” pass them explicitly per task.

Chat template

Phi-4-mini uses the Phi-3 chat template. The tokenizer stores it in the GGUF metadata, so llama.cpp's --chat-template flag isn't required.

<|system|>
{system}<|end|>
<|user|>
{user}<|end|>
<|assistant|>
{assistant}<|end|>

Quantization recipe

Built with llama.cpp at commit e43431b (May 7, 2026).

  1. Downloaded microsoft/Phi-4-mini-instruct safetensors checkpoint via huggingface-cli.
  2. Converted to GGUF FP16 via convert_hf_to_gguf.py --outtype f16 (Phi-4 reuses the Phi-3 architecture path in convert).
  3. Quantized to each target type via llama-quantize:
    llama-quantize Phi-4-mini-F16.gguf Phi-4-mini-instruct-Q4_K_M.gguf Q4_K_M
    llama-quantize Phi-4-mini-F16.gguf Phi-4-mini-instruct-Q5_K_M.gguf Q5_K_M
    llama-quantize Phi-4-mini-F16.gguf Phi-4-mini-instruct-Q8_0.gguf   Q8_0
    

No imatrix calibration was used โ€” the weights come from the upstream FP16 directly.

Original model card

See the upstream model card for full architecture, training, benchmarks, and Microsoft's responsible-AI guidance: microsoft/Phi-4-mini-instruct.

License

MIT License, inherited from the original model โ€” the most permissive license among comparable on-device models. Commercial use, modification, redistribution, and bundling in proprietary apps are all permitted with attribution. See LICENSE for full terms.

Phi-4-mini by Microsoft Corporation. Licensed under the MIT License.

Acknowledgements

  • Microsoft for the original Phi-4-mini weights and an unusually open-licensed release.
  • The llama.cpp team for the GGUF format and quantization tooling.
  • The Haplo ecosystem this drop is built for.
Downloads last month
352
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jc-builds/Phi-4-mini-instruct-GGUF

Quantized
(147)
this model