8x RTX 3090, EPYC 7532, 512GB RAM: Benchmarking MiniMax-M2.5-IQ4_NL for High-Speed Coding (35-75 t/s) with opencode

#10
by martossien - opened

Hello everyone,

Following my tests with ubergarm's GLM-4.7-355B-IQ5_K, I have now extensively tested MiniMax-M2.5-IQ4_NL. My primary use case is development with OpenCode (agentic coding workflow), requiring both large context (100k+) and high generation speed.

While MiniMax-M2.5 might not beat the top-tier coding models in pure reasoning benchmarks, its speed/quality ratio on this hardware is phenomenal for iterative development.

Generation Speed: 35 to 75 tokens/s
VRAM Usage: ~175 GB / 192 GB (leaving room for 2 simultaneous OpenCode sessions)
Context: Tested stable at 262k tokens context length

GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
Note: 2 cards are linked via NVLink, others via PCIe
CPU: AMD EPYC 7532 (32 cores / 64 threads)
RAM: 512 GB DDR4 2933 MHz ECC
Software: ik_llama.cpp + NVIDIA Drivers 580.126.09

~/ik_llama.cpp/build/bin/llama-server
--model /path/to/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias MiniMax-M2.5-IQ4_NL
--host 0.0.0.0
--port 8080
--ctx-size 262144
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 4096
--parallel 2
--flash-attn 1
--n-gpu-layers 999
--split-mode graph
--tensor-split 0.9,1,1,1,1,1,1,1
--numa distribute
--run-time-repack
-gr -ger
--merge-qkv
--cache-type-k q5_1
--cache-type-v q5_1
--k-cache-hadamard
--jinja
--chat-template-kwargs '{"enable_thinking": false}'

--chat-template-kwargs '{"enable_thinking": false}': Mandatory for OpenCode. Without this, you will get an "Assistant response prefill is incompatible with enable_thinking" error because OpenCode forces the start of the assistant's reply (prefill).
--parallel 2: Allows two OpenCode developers to work simultaneously (splitting the 262k context budget).
--tensor-split 0.9,...: Reduces load on the first GPU (Headless server but driving 2 screens), preventing OOM on display tasks.
--cache-type-k/v q5_1: i m not shure of this , it's works ( before i use q4_0 )

Huge thanks to ubergarm for providing these high-quality IQ quants! This makes running such massive MoE models locally not just possible, but incredibly fast.

MiniMax-M2.5-IQ4_NL

@martossien

Super, glad you're enjoying the quants on your huge rig!

A few tips given you are using full GPU offload:

  1. When using full GPU offload, always set threads to -t 1 to minimize overhead of synchronizing unused threads. Might give 1-3% more benefit anecdotally. So remove --threads 32 --threads-batch 64 unless you are explicitly leaving some layers on CPU/RAM.

  2. So your --batch-size 2048 --ubatch-size 4096 is strange. The defaults are -ub 512 -b 2048. The microbatch -ub is always smaller than the logical batch -b size, so you can just use -ub 4096 -b 4096 most of the time for increased PP performance at the cost of some latency in very small prompts and a little more VRAM usage for the compute buffer allocation to hold the larger batch.

  3. Remove --numa distribute as that is only important if you are running any tensors on CPU/RAM and only if you're also used numactl before llama-server. Its not being used but nice to keep your commands clean for less confusion.

  4. Remove --run-time-repack as that is only used for CPU/RAM tensors once again. It is smart enough to not repack tensors on GPU fortunately so you're not hurting yourself. In general I don't use -rtr anymore as even for MoEs using the normal non-repacked quant types give big PP gains for large batch sizes.

  5. --cache-type-k q5_1 --cache-type-v q5_1 --k-cache-hadamard Wow that is pretty aggressive kv-cache quantization, pretty amazing it works with such long context. If you spare the extra VRAM, I'd suggest trying -khad -ctk q6_0 -ctv q8_0 As keeping v cache a bit larger is likely good, and q6_0 with khad is still quite good quality.

Have fun and hopefully I'll have some GLM-5 quants up eventualy! Also super curious about Qwen3(.5)Next too!

With :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf --alias MiniMax-M2.5-IQ4_NL --host 0.0.0.0 --port 8080 --ctx-size 198000 --no-mmap --threads 16 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --tensor-split 0.9,1,1,1,1,1,1,1 -gr -ger --merge-qkv --cache-type-k q6_0 --cache-type-v q8_0 --k-cache-hadamard --jinja --chat-template-kwargs '{"enable_thinking": false}'

I test threads 1 but if i put more Prompt processing is better ( 16 works well , more degrad token/s , i m not shure of this)
--threads 1 --threads-batch 32 seem better
Result :Token: 58.0 t/s ( 35 to 75 ) | Prompt: 891.5 t/s ( 75 to 900)

Thank's again

@martossien

Thanks again for sharing the details of your unique rig! Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?

Obviously use whatever gives better results for you, I don't have the hardware to check all the combinations myself!

Cheers!

@martossien

Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?

Hei John,

Got back online my similar beast to @martossien :
GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
CPU: AMD EPYC 7443P (24 cores / 48 threads)
RAM: 256 GB DDR4 3200 MHz ECC

Not here to brag about it but, much to my surprise too, I cannot offload MiniMax-M2.5-IQ4_NL, not even with CTX at 196608... :( Hits OOM when trying to allocate CTX,I guess.
I am launching it with:
/llms/ik_llama.cpp/build/bin/llama-server
--model ~/models/gguf/ubergarm/MiniMax-M2.5-IQ4_NL/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias "ubergarm/MiniMax-M2.5-IQ4_NL"
-c 196608
-ctk q8_0 -ctv q8_0
-mla 3 -fa 1 -amb 512 --no-mmap
-ngl 99
-ub 4096
-b 4096
--threads 1
--host 0.0.0.0
--port 5005
--jinja
--temp 1.0
--top_p 0.95
--top_k 40
--api-key VLLM_API_KEY_2026
--seed 3407
--chat-template-kwargs '{"reasoning_effort": "high"}'
-ger -sm graph --cache-ram 32768

--cache-ram 32768 --> via https://github.com/ggml-org/llama.cpp/pull/16391 since my OpenCode has to chew on huge CTX for its session-2-session "memory" transfers.
P.S. With exactly the same parameters IQ4_XS works like a charm.

Any hints would be much appreciated!

@dehnhaide

Hei good to see you!

My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?

  • IQ4_XS 114.842 GiB
  • IQ4_NL 121.386 GiB

The difference is not huge, so maybe you just need to save a little VRAM so you can fully offload it onto the GPUs?

A few ways to save VRAM here:

  1. Lower batch sizes down to -ub 2048 -b 2048 which takes a little less VRAM but likely slower PP.
  2. -khad -ctk q6_0 -ctv q8_0 will save a little more space on KV-Cache if you are okay quantizing further
  3. worse case you offload some layers onto CPU/RAM but that will slow it down a lot probably.

Otherwise, a few thoughts:

  1. --cache-ram 32768 you could probably increase this further if you're not using the RAM for anything else
  2. -mla 3 -amb 512 this only applies to models with MLA like Kimi-K2.5/DeepSeek/GLM-5.. doesn't hurt here, but no effect

Keep us posted!

@dehnhaide

Hei good to see you!

My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?

  • IQ4_XS 114.842 GiB
    All OK!
  • IQ4_NL 121.386 GiB
    OOM
  1. Lower batch sizes down to -ub 2048 -b 2048 which takes a little less VRAM but likely slower PP.
    WOW, this did the trick! Who would have thought... so subtle and counter-intuitive for my tired brain...
  2. --cache-ram 32768 you could probably increase this further if you're not using the RAM for anything else
    Good hint too! ;)

As always, much appreciated!

Little come back with better parameters :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf --alias MiniMax-M2.5-IQ4_NL --host 0.0.0.0 --port 8080 --ctx-size 231424 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --tensor-split 1,1,1,1,0.9,1,1,0.88 -gr -ger --merge-qkv --cache-type-k q6_0 --cache-type-v q8_0 --k-cache-hadamard --jinja --chat-template-kwargs '{"enable_thinking": false}'

image

context 231424 keep 0.6 Go vram by gpu ( more stable )
device 4 have my screens so :0.9 device 7 have more vram use at the end

@martossien

Wow you've packed those GPU VRAM full! Nice job dialing in a complex rig!

Does ik support -muge with this MiniMax model? (i'm not sure, haven't checked).

Other tip I realize now that k-cache quantization is likely more sensitive than v-cache, so might get slightly better long context performance with -ctk q8_0 -ctv q6_0 but -khad only applies to k-cache pretty sure so not perfect.

Otherwise have you tuned all your GPUs with LACT yet? You can adjust the temperature vs fan profile, apply a slight undervolt and overclock as well. But tuning might make the system unstable and require reboots so can be tedious if you get too aggressive. In my experience it allows the GPU to run at the full fixed clock speed and no longer oscillates wildly due to thermal/current limits.

You've come a long way in a short time! Cheers!

Little come back with better parameters :
~/ik_llama.cpp/build/bin/llama-server ...

Have you tried mratsim's FP8 version in vllm? I am asking since I have a 99% similar setup to yours (newer gen3 AMD + 256gb RAM) and the amount of speed I got with vllm, specifically for parallel request, is still unmatched in *llama world. And trust me I keep in very high regards the work of IK and John's quants. But in vllm / sglang speed is a different animal and you're running FP8 (even if "translated" to SM86 kernel).

@dehnhaide

Fair, when you have this much VRAM it is always good to check vLLM and that ecosystem as well, especially for batched/concurrent parallel multi-user server setups!

I wish we had an easier way to compare apples-apples the PPL/KLD values though haha..

I wish we had an easier way to compare apples-apples the PPL/KLD values though haha..

Hahaha! Your fine hint at my implying FP8... Dead right you are! If only you'd know how multiple times have I thought about that... More so when even if mratsim's is SOTA quant the model still poses issues (plurals issues) and not having any other means to properly quantify PPL/KLD makes the paranoia goes through the roof...

Thanks a lot for the tips.

For the cache settings, I’m going to test your suggestion with -ctk q8_0 -ctv q6_0, because I also suspect k-cache is a bit more sensitive on longer contexts.

About LACT: I did experiment with that direction a bit, but my setup is made of mixed GPU brands/models, so tuning everything cleanly is more complicated. Also, in most of my workloads the GPUs are not fully saturated all the time, so the gain is often smaller than the amount of testing / rebooting / stability work required. I am actually thinking about a small side project to make that kind of tuning easier, but time is limited, so we’ll see.

Regarding vLLM / SGLang: yes, we did test them too. For multi-user serving, batching and parallel requests, I fully agree that they are very strong. But on our side, for local quantized deployments and for squeezing the best balance out of GPU power / VRAM / CPU / RAM on RTX 3090-class hardware, ik_llama.cpp has been easier to tune and generally more efficient for the kind of setups we are running.

So I would say:
vLLM / SGLang: very strong for concurrent serving and modern serving stacks
ik_llama.cpp: better fit for our local quantized rigs when the goal is maximum hardware efficiency and fine-grained control

And yes, I would also love an easier apples-to-apples way to compare quality between all these formats and quantizations. That part is still frustrating.
For me, the final benchmark is actually my users: a small group of developers who have direct access to my machine and give me feedback on the models and serving stack.

Right now, even if they do not love waiting their turn with --parallel 1, they still prefer:

  1. quality
  2. speed

So in practice, on my side, GLM-4.7 is still the favorite when quality matters most, and MiniMax-M2.5 comes next for simpler work.
But for my actual users, so far, the models we served through vLLM or SGLang were generally less preferred than the ones we tuned and served with ik_llama.cpp.

Of course, this is only feedback from my specific setup and users, not a universal conclusion.

Sign up or log in to comment