Small LLM Agent Benchmark

Real-world browser agent benchmark for small language models on Apple Silicon (16GB)

BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.

TL;DR Results

Rank	Model	Score	Speed	Memory	Notes
🏆 1	Gemma4 E4B Uncensored Q5_K_P	5.0/6	24.5 tok/s	6.3 GB	Overall best
🏆 2	Qwen3.5-9B Uncensored Q6_K	5.0/6	13.5 tok/s	7.8 GB	Most reliable
✨ 3	LFM2-1.2B-Tool Q8_0 (slim)	4.5/6	76 tok/s	2.75 GB	Efficiency king
4	Gemma4 E4B Uncensored Q6_K_P	4.5/6	23.1 tok/s	6.7 GB
5	Qwen3.5-9B Base Q4_K_XL	4.5/6	10.0 tok/s	6.5 GB
6	Gemma4 E4B Uncensored Q8_K_P	4.0/6	19.0 tok/s	8.5 GB	Higher quant = worse!
7	Qwen3.5-9B Uncensored Q4_K_M	3.5/6	16.7 tok/s	6.1 GB
8	Qwen3VL-8B Balanced Q6_K	3.0/6	16.2 tok/s	7.4 GB
9	Bonsai-8B 1-bit	1.0/6	48.8 tok/s	1.5 GB	73% BFCL but 1/6 here
10	LFM2-8B-A1B Q6_K (1.5B active)	1.0/6	69.4 tok/s	6.4 GB	Base model, no tool training
11	LFM2.5-Nova 1.2B Q4	0.0/6	118 tok/s	0.8 GB	4K context too small
12	FunctionGemma 270M Q8	0.0/6	197 tok/s	0.3 GB	Infinite loop
13	Qwopus-27B Q3_K_S	OOM	—	14+ GB	Doesn't fit 16GB

What This Benchmarks

6 real-world browser agent tasks, not synthetic function-call formatting tests:

#	Task	Difficulty	What it tests
T1	Wikipedia info extraction	Easy	Navigate → extract → report
T2	DuckDuckGo search	Medium	Navigate → type → click → read
T3	Hacker News top story	Easy	Navigate → read → stop
T4	Cat image detection (Falcon Perception)	Medium	Navigate → vision_detect → report
T5	Form filling (httpbin POST)	Medium	Navigate → input × 3 → click submit
T6	reCAPTCHA challenge	Hard	Navigate → click → vision → batch click

Each test requires multi-step tool chaining — not single-turn function call formatting.

10 Counter-Intuitive Findings

BFCL ≠ Agent Capability — Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
Higher Quant ≠ Better for MoE — Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
Higher Quant = Better for Dense — Qwen: Q4 (3.5) < Q6 (5.0)
Uncensored ≠ Better Agent — Quality gains come from quantization, not censoring
Faster Backend ≠ Better Results — GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
197 tok/s Model Scores 0/6 — FunctionGemma is useless despite being fastest
4B MoE = 9B Dense — Gemma4 E4B matches Qwen3.5-9B on agent tasks
1.2B Specialized > 8B Base — LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
The "Capability Cliff" Has Exceptions — LFM2-1.2B-Tool breaks the 4B param rule
Small Models Are Context-Starved — Reducing tools 26→8 pushed LFM2 from 4.0→4.5

5-Axis Analysis

Axis 1: Model Family

Minimum ~4B active params for multi-step agent tasks (with one exception)
MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training

Axis 2: Censoring

Uncensored models show no advantage for tool-calling agent tasks
Quality improvements are entirely from quantization level, not censoring

Axis 3: Quantization

MoE models: Q5 is the sweet spot (speed > precision)
Dense models: Q6 is the sweet spot (precision > speed)
Never go below Q4 or above Q8 for agent tasks

Axis 4: Backend

llama.cpp GGUF is the universal winner — native tool calling, no proxy
MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
Ollama has API format issues with Gemma4

Axis 5: Vision

mmproj + Falcon Perception together score 5.0/6 (best)
Either alone scores 4.5/6
Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates

Hardware

Mac Mini M4 16GB (Apple Silicon)
macOS Darwin 24.3.0
llama.cpp b8640 (homebrew)
Falcon Perception v2 (MLX backend)
GUA_Blazor .NET 10 agent framework

Architecture

User Task → GUA_Blazor (agent loop, 25 turns)
  → LLM (llama.cpp, port 8081) — reasoning + tool calling
  → Falcon Perception (MLX, port 8090) — vision detection
  → Playwright Chromium — browser automation

Run It Yourself

Prerequisites

# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception  # or clone github.com/tiiuae/falcon-perception

Quick Speed Test

# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
  Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models

# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
  --port 8081 -ngl 99 -c 16384

# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "test",
  "stream": false,
  "messages": [{"role": "user", "content": "Navigate to google.com"}],
  "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'

Run Benchmark

python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf

Models Tested

Model	HuggingFace	Backend	mmproj?
Gemma4 E4B Uncensored	HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive	llama.cpp	Yes (in repo)
Qwen3.5-9B Uncensored	HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive	llama.cpp	Yes (in repo)
Qwen3.5-9B Base	unsloth/Qwen3.5-9B-GGUF	llama.cpp	Yes
LFM2-1.2B-Tool	LiquidAI/LFM2-1.2B-Tool-GGUF	llama.cpp	No (text only)
Bonsai-8B	prism-ml/Bonsai-8B-gguf	PrismML fork	No
Gemma4 E4B Base (MLX)	mlx-community/gemma-4-e4b-it-4bit	mlx_vlm	Native
LFM2-8B-A1B	LiquidAI/LFM2-8B-A1B-GGUF	llama.cpp	No
FunctionGemma 270M	unsloth/functiongemma-270m-it-GGUF	llama.cpp	No

Key Files

bench/
  run_benchmark.py       — Main benchmark runner
  tasks.json             — 6 test task definitions
  results/               — Raw results from all runs
reports/
  FINAL_Report.md        — Complete 5-axis analysis
  Multi_Axis_Analysis.md — Detailed breakdown per axis
  Model_Comparison.md    — Side-by-side tables
proxies/
  gemma4_proxy.py        — Gemma4 MLX → LlmTornado proxy (7 fixes)
  lfm2_proxy.py          — LFM2 pythonic tool-call proxy
vision/
  falcon_vision_server.py — Falcon Perception 3-layer adaptive pipeline

Citation

If you use this benchmark, please cite:

@misc{small-llm-agent-bench-2026,
  title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
  author={Xavier},
  year={2026},
  url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support