llama.cpp Expert Sniper β€” madvise prefetch for MoE inference

~65 lines of C++ that enables MoE models larger than RAM on llama.cpp.

Stock llama.cpp thrashes indefinitely. This build generates tokens.

Results

Hardware RAM Model Stock llama.cpp madvise build
M2 MacBook Air 8 GB Qwen3.5-35B-A3B IQ2_M (10.6 GB) 0 tok/s (thrash) 0.57 tok/s
M2 MacBook Air 8 GB Same model, no-op callback only 0 tok/s (thrash) 0.46 tok/s

On GPU machines with abundant RAM (A100 251GB, RTX 3090 31GB), stock llama.cpp is faster β€” the OS page cache handles it. madvise helps specifically when system RAM < model size and layers are on CPU (ngl 0).

How it works

MoE models activate 8 of 128+ experts per token. Consecutive tokens share ~87% of active experts. Stock llama.cpp uses mmap but the OS has no idea which expert pages are hot β€” it evicts them randomly, causing a page fault storm.

Our patch hooks llama.cpp's eval callback, intercepts every ggml_mul_mat_id operation, reads the router's top-k expert selection from t->src[2], and calls madvise(MADV_WILLNEED) on each active expert's memory range. This tells the kernel which pages to prefetch before the compute needs them.

Zero allocation. Zero memcpy. Zero mutex. One syscall per expert slice.

What we learned

1. madvise beats LRU cache everywhere. We first built a 460-line LRU cache. It was 2.4x slower than 15 lines of madvise (0.24 vs 0.57 tok/s on 8GB MacBook). The cache stole 5GB from the OS page cache for duplicate data. Don't fight the OS page cache β€” coach it.

2. Even a no-op callback prevents thrashing. Just hooking the eval callback and inspecting tensor pointers (without any madvise) produces 0.46 tok/s where stock produces zero. The callback inadvertently warms mmap pages through pointer inspection.

3. Device pointer bug. All prior GPU benchmarks were silently invalid β€” the callback dereferenced t->src[2]->data without checking ggml_backend_buffer_is_host(). On GPU layers this was a CUDA device pointer. Fixed.

4. On abundant RAM, do nothing. The OS page cache is remarkably good when it has enough room. Any intervention is pure overhead when RAM exceeds model size.

Build

git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/llama-cpp

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

# NVIDIA GPU (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

Usage

# Enable madvise prefetch
./build/bin/llama-server \
  -m Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201

# Test no-op mode (isolate callback overhead)
EXPERT_CACHE_NOOP=1 ./build/bin/llama-server \
  -m model.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201

Files changed vs stock llama.cpp

New files (~430 lines):

File Purpose
src/llama-expert-cache-ctx.cpp Eval callback, madvise prefetch, tensor identification
src/llama-expert-cache-ctx.h Context struct and declarations
src/llama-expert-cache.cpp LRU cache (deprecated, retained for reference)
src/llama-expert-cache.h Cache class definition

Patched files (~30 lines across 5 files):

File Change
src/CMakeLists.txt Added new source files to build
src/llama-context.h Added expert cache context member
common/common.h Added expert_cache_size parameter
common/common.cpp Cache init + eval callback registration
common/arg.cpp --expert-cache-size CLI flag

The research journey

460-line LRU cache β†’ 0.24 tok/s (stole RAM from OS page cache)
15-line madvise    β†’ 0.57 tok/s (coached the OS page cache)
no-op callback     β†’ 0.46 tok/s (accidental page warming)

The cache was the experiment. madvise was the answer.

Full three-way benchmark (8 GB MacBook Air)

Config tok/s Mechanism
Stock llama.cpp 0 (thrash) OS blind LRU, no domain knowledge
No-op callback 0.46 Accidental page warming from tensor inspection
madvise prefetch 0.57 Explicit kernel prefetch hints
LRU cache (5 GB) 0.24 Duplicate data in user-space heap

Gemma 4-26B-A4B β€” MoE Sparsity Benchmark

Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:

Hardware Quant Model size RAM Speed Notes
M2 MacBook Air IQ2_M 9.3 GB 8 GB 1.37 tok/s Model exceeds RAM, MoE sparsity prevents thrash
M4 Mac Mini IQ2_M 9.3 GB 16 GB 36.5 tok/s Fits in RAM, full GPU speed
M4 Mac Mini Q4_K_M 16.9 GB 16 GB 5.18 tok/s Exceeds RAM, still runs smoothly
M4 Mac Mini Q8_0 26.9 GB 16 GB 0 tok/s (thrash) CPU_REPACK doubles memory to 51 GB, can't load

All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.

Finding: Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.

Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support