YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Small LLM Agent Benchmark
Real-world browser agent benchmark for small language models on Apple Silicon (16GB)
BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.
TL;DR Results
| Rank | Model | Score | Speed | Memory | Notes |
|---|---|---|---|---|---|
| π 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best |
| π 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable |
| β¨ 3 | LFM2-1.2B-Tool Q8_0 (slim) | 4.5/6 | 76 tok/s | 2.75 GB | Efficiency king |
| 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | |
| 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | |
| 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! |
| 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | |
| 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | |
| 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here |
| 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training |
| 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small |
| 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop |
| 13 | Qwopus-27B Q3_K_S | OOM | β | 14+ GB | Doesn't fit 16GB |
What This Benchmarks
6 real-world browser agent tasks, not synthetic function-call formatting tests:
| # | Task | Difficulty | What it tests |
|---|---|---|---|
| T1 | Wikipedia info extraction | Easy | Navigate β extract β report |
| T2 | DuckDuckGo search | Medium | Navigate β type β click β read |
| T3 | Hacker News top story | Easy | Navigate β read β stop |
| T4 | Cat image detection (Falcon Perception) | Medium | Navigate β vision_detect β report |
| T5 | Form filling (httpbin POST) | Medium | Navigate β input Γ 3 β click submit |
| T6 | reCAPTCHA challenge | Hard | Navigate β click β vision β batch click |
Each test requires multi-step tool chaining β not single-turn function call formatting.
10 Counter-Intuitive Findings
- BFCL β Agent Capability β Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
- Higher Quant β Better for MoE β Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
- Higher Quant = Better for Dense β Qwen: Q4 (3.5) < Q6 (5.0)
- Uncensored β Better Agent β Quality gains come from quantization, not censoring
- Faster Backend β Better Results β GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
- 197 tok/s Model Scores 0/6 β FunctionGemma is useless despite being fastest
- 4B MoE = 9B Dense β Gemma4 E4B matches Qwen3.5-9B on agent tasks
- 1.2B Specialized > 8B Base β LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
- The "Capability Cliff" Has Exceptions β LFM2-1.2B-Tool breaks the 4B param rule
- Small Models Are Context-Starved β Reducing tools 26β8 pushed LFM2 from 4.0β4.5
5-Axis Analysis
Axis 1: Model Family
- Minimum ~4B active params for multi-step agent tasks (with one exception)
- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training
Axis 2: Censoring
- Uncensored models show no advantage for tool-calling agent tasks
- Quality improvements are entirely from quantization level, not censoring
Axis 3: Quantization
- MoE models: Q5 is the sweet spot (speed > precision)
- Dense models: Q6 is the sweet spot (precision > speed)
- Never go below Q4 or above Q8 for agent tasks
Axis 4: Backend
- llama.cpp GGUF is the universal winner β native tool calling, no proxy
- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
- Ollama has API format issues with Gemma4
Axis 5: Vision
- mmproj + Falcon Perception together score 5.0/6 (best)
- Either alone scores 4.5/6
- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates
Hardware
- Mac Mini M4 16GB (Apple Silicon)
- macOS Darwin 24.3.0
- llama.cpp b8640 (homebrew)
- Falcon Perception v2 (MLX backend)
- GUA_Blazor .NET 10 agent framework
Architecture
User Task β GUA_Blazor (agent loop, 25 turns)
β LLM (llama.cpp, port 8081) β reasoning + tool calling
β Falcon Perception (MLX, port 8090) β vision detection
β Playwright Chromium β browser automation
Run It Yourself
Prerequisites
# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception # or clone github.com/tiiuae/falcon-perception
Quick Speed Test
# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models
# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
--port 8081 -ngl 99 -c 16384
# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "test",
"stream": false,
"messages": [{"role": "user", "content": "Navigate to google.com"}],
"tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'
Run Benchmark
python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
Models Tested
| Model | HuggingFace | Backend | mmproj? |
|---|---|---|---|
| Gemma4 E4B Uncensored | HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Uncensored | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Base | unsloth/Qwen3.5-9B-GGUF | llama.cpp | Yes |
| LFM2-1.2B-Tool | LiquidAI/LFM2-1.2B-Tool-GGUF | llama.cpp | No (text only) |
| Bonsai-8B | prism-ml/Bonsai-8B-gguf | PrismML fork | No |
| Gemma4 E4B Base (MLX) | mlx-community/gemma-4-e4b-it-4bit | mlx_vlm | Native |
| LFM2-8B-A1B | LiquidAI/LFM2-8B-A1B-GGUF | llama.cpp | No |
| FunctionGemma 270M | unsloth/functiongemma-270m-it-GGUF | llama.cpp | No |
Key Files
bench/
run_benchmark.py β Main benchmark runner
tasks.json β 6 test task definitions
results/ β Raw results from all runs
reports/
FINAL_Report.md β Complete 5-axis analysis
Multi_Axis_Analysis.md β Detailed breakdown per axis
Model_Comparison.md β Side-by-side tables
proxies/
gemma4_proxy.py β Gemma4 MLX β LlmTornado proxy (7 fixes)
lfm2_proxy.py β LFM2 pythonic tool-call proxy
vision/
falcon_vision_server.py β Falcon Perception 3-layer adaptive pipeline
Citation
If you use this benchmark, please cite:
@misc{small-llm-agent-bench-2026,
title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
author={Xavier},
year={2026},
url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support