llama.cpp Expert Sniper β madvise prefetch for MoE inference
~65 lines of C++ that enables MoE models larger than RAM on llama.cpp.
Stock llama.cpp thrashes indefinitely. This build generates tokens.
Results
| Hardware | RAM | Model | Stock llama.cpp | madvise build |
|---|---|---|---|---|
| M2 MacBook Air | 8 GB | Qwen3.5-35B-A3B IQ2_M (10.6 GB) | 0 tok/s (thrash) | 0.57 tok/s |
| M2 MacBook Air | 8 GB | Same model, no-op callback only | 0 tok/s (thrash) | 0.46 tok/s |
On GPU machines with abundant RAM (A100 251GB, RTX 3090 31GB), stock llama.cpp is faster β the OS page cache handles it. madvise helps specifically when system RAM < model size and layers are on CPU (ngl 0).
How it works
MoE models activate 8 of 128+ experts per token. Consecutive tokens share ~87% of active experts. Stock llama.cpp uses mmap but the OS has no idea which expert pages are hot β it evicts them randomly, causing a page fault storm.
Our patch hooks llama.cpp's eval callback, intercepts every ggml_mul_mat_id operation, reads the router's top-k expert selection from t->src[2], and calls madvise(MADV_WILLNEED) on each active expert's memory range. This tells the kernel which pages to prefetch before the compute needs them.
Zero allocation. Zero memcpy. Zero mutex. One syscall per expert slice.
What we learned
1. madvise beats LRU cache everywhere. We first built a 460-line LRU cache. It was 2.4x slower than 15 lines of madvise (0.24 vs 0.57 tok/s on 8GB MacBook). The cache stole 5GB from the OS page cache for duplicate data. Don't fight the OS page cache β coach it.
2. Even a no-op callback prevents thrashing. Just hooking the eval callback and inspecting tensor pointers (without any madvise) produces 0.46 tok/s where stock produces zero. The callback inadvertently warms mmap pages through pointer inspection.
3. Device pointer bug. All prior GPU benchmarks were silently invalid β the callback dereferenced t->src[2]->data without checking ggml_backend_buffer_is_host(). On GPU layers this was a CUDA device pointer. Fixed.
4. On abundant RAM, do nothing. The OS page cache is remarkably good when it has enough room. Any intervention is pure overhead when RAM exceeds model size.
Build
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/llama-cpp
# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server
# NVIDIA GPU (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server
Usage
# Enable madvise prefetch
./build/bin/llama-server \
-m Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
-ngl 0 \
--expert-cache-size 1 \
--port 8201
# Test no-op mode (isolate callback overhead)
EXPERT_CACHE_NOOP=1 ./build/bin/llama-server \
-m model.gguf \
-ngl 0 \
--expert-cache-size 1 \
--port 8201
Files changed vs stock llama.cpp
New files (~430 lines):
| File | Purpose |
|---|---|
src/llama-expert-cache-ctx.cpp |
Eval callback, madvise prefetch, tensor identification |
src/llama-expert-cache-ctx.h |
Context struct and declarations |
src/llama-expert-cache.cpp |
LRU cache (deprecated, retained for reference) |
src/llama-expert-cache.h |
Cache class definition |
Patched files (~30 lines across 5 files):
| File | Change |
|---|---|
src/CMakeLists.txt |
Added new source files to build |
src/llama-context.h |
Added expert cache context member |
common/common.h |
Added expert_cache_size parameter |
common/common.cpp |
Cache init + eval callback registration |
common/arg.cpp |
--expert-cache-size CLI flag |
The research journey
460-line LRU cache β 0.24 tok/s (stole RAM from OS page cache)
15-line madvise β 0.57 tok/s (coached the OS page cache)
no-op callback β 0.46 tok/s (accidental page warming)
The cache was the experiment. madvise was the answer.
Full three-way benchmark (8 GB MacBook Air)
| Config | tok/s | Mechanism |
|---|---|---|
| Stock llama.cpp | 0 (thrash) | OS blind LRU, no domain knowledge |
| No-op callback | 0.46 | Accidental page warming from tensor inspection |
| madvise prefetch | 0.57 | Explicit kernel prefetch hints |
| LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |
Gemma 4-26B-A4B β MoE Sparsity Benchmark
Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:
| Hardware | Quant | Model size | RAM | Speed | Notes |
|---|---|---|---|---|---|
| M2 MacBook Air | IQ2_M | 9.3 GB | 8 GB | 1.37 tok/s | Model exceeds RAM, MoE sparsity prevents thrash |
| M4 Mac Mini | IQ2_M | 9.3 GB | 16 GB | 36.5 tok/s | Fits in RAM, full GPU speed |
| M4 Mac Mini | Q4_K_M | 16.9 GB | 16 GB | 5.18 tok/s | Exceeds RAM, still runs smoothly |
| M4 Mac Mini | Q8_0 | 26.9 GB | 16 GB | 0 tok/s (thrash) | CPU_REPACK doubles memory to 51 GB, can't load |
All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.
Finding: Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.
Related
- MLX Expert Sniper (Apple Silicon, 5.4 tok/s on 35B): huggingface.co/waltgrace/mlx-expert-sniper
- Full research + code: github.com/walter-grace/mac-code/tree/main/research/expert-sniper