YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Small LLM Agent Benchmark

Real-world browser agent benchmark for small language models on Apple Silicon (16GB)

BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.

TL;DR Results

Rank Model Score Speed Memory Notes
πŸ† 1 Gemma4 E4B Uncensored Q5_K_P 5.0/6 24.5 tok/s 6.3 GB Overall best
πŸ† 2 Qwen3.5-9B Uncensored Q6_K 5.0/6 13.5 tok/s 7.8 GB Most reliable
✨ 3 LFM2-1.2B-Tool Q8_0 (slim) 4.5/6 76 tok/s 2.75 GB Efficiency king
4 Gemma4 E4B Uncensored Q6_K_P 4.5/6 23.1 tok/s 6.7 GB
5 Qwen3.5-9B Base Q4_K_XL 4.5/6 10.0 tok/s 6.5 GB
6 Gemma4 E4B Uncensored Q8_K_P 4.0/6 19.0 tok/s 8.5 GB Higher quant = worse!
7 Qwen3.5-9B Uncensored Q4_K_M 3.5/6 16.7 tok/s 6.1 GB
8 Qwen3VL-8B Balanced Q6_K 3.0/6 16.2 tok/s 7.4 GB
9 Bonsai-8B 1-bit 1.0/6 48.8 tok/s 1.5 GB 73% BFCL but 1/6 here
10 LFM2-8B-A1B Q6_K (1.5B active) 1.0/6 69.4 tok/s 6.4 GB Base model, no tool training
11 LFM2.5-Nova 1.2B Q4 0.0/6 118 tok/s 0.8 GB 4K context too small
12 FunctionGemma 270M Q8 0.0/6 197 tok/s 0.3 GB Infinite loop
13 Qwopus-27B Q3_K_S OOM β€” 14+ GB Doesn't fit 16GB

What This Benchmarks

6 real-world browser agent tasks, not synthetic function-call formatting tests:

# Task Difficulty What it tests
T1 Wikipedia info extraction Easy Navigate β†’ extract β†’ report
T2 DuckDuckGo search Medium Navigate β†’ type β†’ click β†’ read
T3 Hacker News top story Easy Navigate β†’ read β†’ stop
T4 Cat image detection (Falcon Perception) Medium Navigate β†’ vision_detect β†’ report
T5 Form filling (httpbin POST) Medium Navigate β†’ input Γ— 3 β†’ click submit
T6 reCAPTCHA challenge Hard Navigate β†’ click β†’ vision β†’ batch click

Each test requires multi-step tool chaining β€” not single-turn function call formatting.

10 Counter-Intuitive Findings

  1. BFCL β‰  Agent Capability β€” Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
  2. Higher Quant β‰  Better for MoE β€” Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
  3. Higher Quant = Better for Dense β€” Qwen: Q4 (3.5) < Q6 (5.0)
  4. Uncensored β‰  Better Agent β€” Quality gains come from quantization, not censoring
  5. Faster Backend β‰  Better Results β€” GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
  6. 197 tok/s Model Scores 0/6 β€” FunctionGemma is useless despite being fastest
  7. 4B MoE = 9B Dense β€” Gemma4 E4B matches Qwen3.5-9B on agent tasks
  8. 1.2B Specialized > 8B Base β€” LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
  9. The "Capability Cliff" Has Exceptions β€” LFM2-1.2B-Tool breaks the 4B param rule
  10. Small Models Are Context-Starved β€” Reducing tools 26β†’8 pushed LFM2 from 4.0β†’4.5

5-Axis Analysis

Axis 1: Model Family

  • Minimum ~4B active params for multi-step agent tasks (with one exception)
  • MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
  • Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training

Axis 2: Censoring

  • Uncensored models show no advantage for tool-calling agent tasks
  • Quality improvements are entirely from quantization level, not censoring

Axis 3: Quantization

  • MoE models: Q5 is the sweet spot (speed > precision)
  • Dense models: Q6 is the sweet spot (precision > speed)
  • Never go below Q4 or above Q8 for agent tasks

Axis 4: Backend

  • llama.cpp GGUF is the universal winner β€” native tool calling, no proxy
  • MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
  • Ollama has API format issues with Gemma4

Axis 5: Vision

  • mmproj + Falcon Perception together score 5.0/6 (best)
  • Either alone scores 4.5/6
  • Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates

Hardware

  • Mac Mini M4 16GB (Apple Silicon)
  • macOS Darwin 24.3.0
  • llama.cpp b8640 (homebrew)
  • Falcon Perception v2 (MLX backend)
  • GUA_Blazor .NET 10 agent framework

Architecture

User Task β†’ GUA_Blazor (agent loop, 25 turns)
  β†’ LLM (llama.cpp, port 8081) β€” reasoning + tool calling
  β†’ Falcon Perception (MLX, port 8090) β€” vision detection
  β†’ Playwright Chromium β€” browser automation

Run It Yourself

Prerequisites

# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception  # or clone github.com/tiiuae/falcon-perception

Quick Speed Test

# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
  Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models

# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
  --port 8081 -ngl 99 -c 16384

# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "test",
  "stream": false,
  "messages": [{"role": "user", "content": "Navigate to google.com"}],
  "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'

Run Benchmark

python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf

Models Tested

Model HuggingFace Backend mmproj?
Gemma4 E4B Uncensored HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive llama.cpp Yes (in repo)
Qwen3.5-9B Uncensored HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive llama.cpp Yes (in repo)
Qwen3.5-9B Base unsloth/Qwen3.5-9B-GGUF llama.cpp Yes
LFM2-1.2B-Tool LiquidAI/LFM2-1.2B-Tool-GGUF llama.cpp No (text only)
Bonsai-8B prism-ml/Bonsai-8B-gguf PrismML fork No
Gemma4 E4B Base (MLX) mlx-community/gemma-4-e4b-it-4bit mlx_vlm Native
LFM2-8B-A1B LiquidAI/LFM2-8B-A1B-GGUF llama.cpp No
FunctionGemma 270M unsloth/functiongemma-270m-it-GGUF llama.cpp No

Key Files

bench/
  run_benchmark.py       β€” Main benchmark runner
  tasks.json             β€” 6 test task definitions
  results/               β€” Raw results from all runs
reports/
  FINAL_Report.md        β€” Complete 5-axis analysis
  Multi_Axis_Analysis.md β€” Detailed breakdown per axis
  Model_Comparison.md    β€” Side-by-side tables
proxies/
  gemma4_proxy.py        β€” Gemma4 MLX β†’ LlmTornado proxy (7 fixes)
  lfm2_proxy.py          β€” LFM2 pythonic tool-call proxy
vision/
  falcon_vision_server.py β€” Falcon Perception 3-layer adaptive pipeline

Citation

If you use this benchmark, please cite:

@misc{small-llm-agent-bench-2026,
  title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
  author={Xavier},
  year={2026},
  url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support