diffuse-cpp: C++ inference engine for Dream on CPU (GGUF format, Q4_K_M quantization)
#4
by Carmenest - opened
Hi Dream team! We have built CPU inference support for Dream-v0-Instruct-7B using diffuse-cpp, a C++ inference engine for diffusion language models built on GGML.
Pre-quantized GGUF models
Available at diffuse-cpp/Dream-v0-Instruct-7B-GGUF:
| File | Type | Size |
|---|---|---|
| dream-7b-f16.gguf | F16 | 15.2 GB |
| dream-7b-q8_0.gguf | Q8_0 | 8.6 GB |
| dream-7b-q4km.gguf | Q4_K_M | 5.3 GB |
Performance (Q4_K_M, entropy_exit + inter-step cache, 12 threads)
| Prompt | tok/s | Steps | vs llama.cpp |
|---|---|---|---|
| Capital of France? | 21.6 | 2 | 2.5x |
| 15 x 23? | 21.6 | 2 | 2.5x |
| Translate to French | 14.3 | 6 | 1.7x |
| Python is_prime() | 8.2 | 7 | 1.0x |
| Average | 11.6 | 1.4x |
Dream excels at math and code prompts — correctly solves 15x23=345 in just 2 denoising steps at 21.6 tok/s.
Key features
- entropy_exit: adaptive scheduler that exits early when the model is confident (2-7 steps for easy prompts vs 16 for hard ones)
- Inter-step KV cache: reuses K,V tensors between denoising steps (1.6x average speedup)
- Full GQA support: 28 query / 4 KV heads handled natively
- QKV biases: preserved at F32 in all quantizations
Comparison with LLaDA-8B
We also support LLaDA-8B. The two models are complementary:
- Dream excels at math and code (21.6 tok/s)
- LLaDA excels at translation (27.7 tok/s)
Links
- Engine: github.com/iafiscal1212/diffuse-cpp
- Paper: doi.org/10.5281/zenodo.19119813
- GGUF models: huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF
Thank you for creating Dream — the GQA architecture and autoregressive logit shift are elegant design choices that translate well to CPU inference!
That's so cool, thanks for your efforts on building this!
jiacheng-ye, muchísimas gracias
en breve subo más mejoras!!!