GLM-4.6-REAP-218B-A32B W4A16 (AutoRound Quantization)
This is a 4-bit quantized version of cerebras/GLM-4.6-REAP-218B-A32B using Intel AutoRound.
Model Details
| Property | Value |
|---|---|
| Base Model | cerebras/GLM-4.6-REAP-218B-A32B |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Method | Intel AutoRound |
| Format | auto_round (compatible with vLLM, SGLang) |
| Architecture | GLM-4 Mixture of Experts |
| Total Parameters | 218B |
| Active Parameters | 32B (A32B) |
| Original Size | ~436 GB (BF16) |
| Quantized Size | ~116 GB |
Performance Benchmarks
Tested on 8x NVIDIA RTX 3090 (24GB each) with vLLM:
Speed Test (~20k context)
| Metric | Value |
|---|---|
| Prompt Tokens | ~21,178 |
| Completion Tokens | 393 |
| Time to First Token (TTFB) | 23.82s |
| Total Generation Time | 36.45s |
| Prefill Speed | ~889 tok/s |
| Generation Speed | ~31 tok/s |
Coherence Test
The model correctly recalled all embedded facts from a long context:
- Character name: Aurelia
- Product code: ZX-42-ALPHA
- Transaction amount: 7,530,000 credits
- Scientist name: Dr. Linh Tran
- Date: 2025-12-15
Usage
vLLM (Recommended)
vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 4 --pipeline-parallel-size 2 \
--quantization auto-round \
--kv-cache-dtype fp8 \
--max-model-len 200000 \
--gpu-memory-utilization 0.88 \
--cpu-offload-gb 4 \
--block-size 32 \
--max-num-seqs 8 \
--max-num-batched-tokens 8192 \
--swap-space 32 \
--enable-expert-parallel \
--enable-prefix-caching \
--enable-chunked-prefill \
--disable-custom-all-reduce \
--disable-log-requests \
--trust-remote-code
SGLang
python3 -m sglang.launch_server \
--model-path 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
--tp-size 8 \
--trust-remote-code
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| 8x 24GB GPUs | ~192GB total | TP=4, PP=2, recommended |
| 4x 48GB GPUs | ~192GB total | TP=4, no PP needed |
| 8x 48GB GPUs | ~384GB total | Full speed, larger batches |
Minimum: 8x 24GB GPUs (RTX 3090/4090) or equivalent ~192GB total VRAM.
Quantization Details
Method
Quantized using Intel AutoRound with the following configuration:
- Scheme: W4A16 (4-bit weights, 16-bit activations)
- Calibration samples: 64
- Sequence length: 512
- Batch size: 1
Quantization Script
#!/usr/bin/env python3
"""
GLM-4.6-REAP-218B W4A16 Quantization using Intel AutoRound
Produces SGLang/vLLM-compatible 4-bit quantized model.
"""
import logging
from datetime import datetime
from pathlib import Path
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
MODEL_ID = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B" # or "cerebras/GLM-4.6-REAP-218B-A32B"
OUTPUT_DIR = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound"
def main():
logger.info("=" * 80)
logger.info("GLM-4.6-REAP-218B W4A16 Quantization (Intel AutoRound)")
logger.info("=" * 80)
start = datetime.now()
from auto_round import AutoRound
logger.info(f"Model: {MODEL_ID}")
logger.info(f"Output: {OUTPUT_DIR}")
logger.info(f"Scheme: W4A16 (4-bit weights, 16-bit activations)")
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
logger.info("Initializing AutoRound (CPU mode)...")
autoround = AutoRound(
MODEL_ID,
scheme="W4A16",
device="cpu",
device_map="cpu",
trust_remote_code=True,
batch_size=1,
seqlen=512,
nsamples=64,
)
logger.info("Starting quantization...")
autoround.quantize_and_save(OUTPUT_DIR, format="auto_round")
elapsed = datetime.now() - start
logger.info("=" * 80)
logger.info(f"Done in {elapsed}")
logger.info(f"Output: {OUTPUT_DIR}")
logger.info("=" * 80)
if __name__ == "__main__":
main()
About the Base Model
GLM-4.6-REAP-218B-A32B is a Mixture of Experts (MoE) model from Cerebras with:
- 218 billion total parameters
- 32 billion active parameters per forward pass
- Strong performance on reasoning and long-context tasks
- Native support for 128k+ context windows
For more details, see the base model card.
Limitations
- Quantization may slightly reduce quality compared to BF16
- Requires significant VRAM (~192GB minimum across GPUs)
- Best results with tensor parallelism across 4-8 GPUs
License
This quantized model inherits the license from the base model. See cerebras/GLM-4.6-REAP-218B-A32B for licensing details.
Acknowledgments
- Cerebras for the base GLM-4.6-REAP model
- Intel for the AutoRound quantization toolkit
- vLLM and SGLang teams for inference support
Citation
If you use this model, please cite the original:
@misc{glm46reap,
title={GLM-4.6-REAP-218B-A32B},
author={Cerebras},
year={2024},
url={https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B}
}
- Downloads last month
- 411