llama.cpp Expert Sniper — madvise prefetch for MoE inference

~65 lines of C++ that enables MoE models larger than RAM on llama.cpp.

Stock llama.cpp thrashes indefinitely. This build generates tokens.

Results

Hardware	RAM	Model	Stock llama.cpp	madvise build
M2 MacBook Air	8 GB	Qwen3.5-35B-A3B IQ2_M (10.6 GB)	0 tok/s (thrash)	0.57 tok/s
M2 MacBook Air	8 GB	Same model, no-op callback only	0 tok/s (thrash)	0.46 tok/s

On GPU machines with abundant RAM (A100 251GB, RTX 3090 31GB), stock llama.cpp is faster — the OS page cache handles it. madvise helps specifically when system RAM < model size and layers are on CPU (ngl 0).

How it works

MoE models activate 8 of 128+ experts per token. Consecutive tokens share ~87% of active experts. Stock llama.cpp uses mmap but the OS has no idea which expert pages are hot — it evicts them randomly, causing a page fault storm.

Our patch hooks llama.cpp's eval callback, intercepts every ggml_mul_mat_id operation, reads the router's top-k expert selection from t->src[2], and calls madvise(MADV_WILLNEED) on each active expert's memory range. This tells the kernel which pages to prefetch before the compute needs them.

Zero allocation. Zero memcpy. Zero mutex. One syscall per expert slice.

What we learned

1. madvise beats LRU cache everywhere. We first built a 460-line LRU cache. It was 2.4x slower than 15 lines of madvise (0.24 vs 0.57 tok/s on 8GB MacBook). The cache stole 5GB from the OS page cache for duplicate data. Don't fight the OS page cache — coach it.

2. Even a no-op callback prevents thrashing. Just hooking the eval callback and inspecting tensor pointers (without any madvise) produces 0.46 tok/s where stock produces zero. The callback inadvertently warms mmap pages through pointer inspection.

3. Device pointer bug. All prior GPU benchmarks were silently invalid — the callback dereferenced t->src[2]->data without checking ggml_backend_buffer_is_host(). On GPU layers this was a CUDA device pointer. Fixed.

4. On abundant RAM, do nothing. The OS page cache is remarkably good when it has enough room. Any intervention is pure overhead when RAM exceeds model size.

Build

git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/llama-cpp

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

# NVIDIA GPU (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

Usage

# Enable madvise prefetch
./build/bin/llama-server \
  -m Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201

# Test no-op mode (isolate callback overhead)
EXPERT_CACHE_NOOP=1 ./build/bin/llama-server \
  -m model.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201

Files changed vs stock llama.cpp

New files (~430 lines):

File	Purpose
`src/llama-expert-cache-ctx.cpp`	Eval callback, madvise prefetch, tensor identification
`src/llama-expert-cache-ctx.h`	Context struct and declarations
`src/llama-expert-cache.cpp`	LRU cache (deprecated, retained for reference)
`src/llama-expert-cache.h`	Cache class definition

Patched files (~30 lines across 5 files):

File	Change
`src/CMakeLists.txt`	Added new source files to build
`src/llama-context.h`	Added expert cache context member
`common/common.h`	Added `expert_cache_size` parameter
`common/common.cpp`	Cache init + eval callback registration
`common/arg.cpp`	`--expert-cache-size` CLI flag

The research journey

460-line LRU cache → 0.24 tok/s (stole RAM from OS page cache)
15-line madvise    → 0.57 tok/s (coached the OS page cache)
no-op callback     → 0.46 tok/s (accidental page warming)

The cache was the experiment. madvise was the answer.

Full three-way benchmark (8 GB MacBook Air)

Config	tok/s	Mechanism
Stock llama.cpp	0 (thrash)	OS blind LRU, no domain knowledge
No-op callback	0.46	Accidental page warming from tensor inspection
madvise prefetch	0.57	Explicit kernel prefetch hints
LRU cache (5 GB)	0.24	Duplicate data in user-space heap

Gemma 4-26B-A4B — MoE Sparsity Benchmark

Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:

Hardware	Quant	Model size	RAM	Speed	Notes
M2 MacBook Air	IQ2_M	9.3 GB	8 GB	1.37 tok/s	Model exceeds RAM, MoE sparsity prevents thrash
M4 Mac Mini	IQ2_M	9.3 GB	16 GB	36.5 tok/s	Fits in RAM, full GPU speed
M4 Mac Mini	Q4_K_M	16.9 GB	16 GB	5.18 tok/s	Exceeds RAM, still runs smoothly
M4 Mac Mini	Q8_0	26.9 GB	16 GB	0 tok/s (thrash)	CPU_REPACK doubles memory to 51 GB, can't load

All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.

Finding: Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.

MLX Expert Sniper (Apple Silicon, 5.4 tok/s on 35B): huggingface.co/waltgrace/mlx-expert-sniper
Full research + code: github.com/walter-grace/mac-code/tree/main/research/expert-sniper

Downloads last month: -; Downloads are not tracked for this model. How to track

waltgrace
/

llama-cpp-expert-sniper