diffuse-cpp: C++ inference engine for Dream on CPU (GGUF format, Q4_K_M quantization)

#4
by Carmenest - opened

Hi Dream team! We have built CPU inference support for Dream-v0-Instruct-7B using diffuse-cpp, a C++ inference engine for diffusion language models built on GGML.

Pre-quantized GGUF models

Available at diffuse-cpp/Dream-v0-Instruct-7B-GGUF:

File Type Size
dream-7b-f16.gguf F16 15.2 GB
dream-7b-q8_0.gguf Q8_0 8.6 GB
dream-7b-q4km.gguf Q4_K_M 5.3 GB

Performance (Q4_K_M, entropy_exit + inter-step cache, 12 threads)

Prompt tok/s Steps vs llama.cpp
Capital of France? 21.6 2 2.5x
15 x 23? 21.6 2 2.5x
Translate to French 14.3 6 1.7x
Python is_prime() 8.2 7 1.0x
Average 11.6 1.4x

Dream excels at math and code prompts — correctly solves 15x23=345 in just 2 denoising steps at 21.6 tok/s.

Key features

  • entropy_exit: adaptive scheduler that exits early when the model is confident (2-7 steps for easy prompts vs 16 for hard ones)
  • Inter-step KV cache: reuses K,V tensors between denoising steps (1.6x average speedup)
  • Full GQA support: 28 query / 4 KV heads handled natively
  • QKV biases: preserved at F32 in all quantizations

Comparison with LLaDA-8B

We also support LLaDA-8B. The two models are complementary:

  • Dream excels at math and code (21.6 tok/s)
  • LLaDA excels at translation (27.7 tok/s)

Links

Thank you for creating Dream — the GQA architecture and autoregressive logit shift are elegant design choices that translate well to CPU inference!

Dream Org org

That's so cool, thanks for your efforts on building this!

jiacheng-ye, muchísimas gracias

en breve subo más mejoras!!!

Sign up or log in to comment