Multi token prediction not working
#6
by
abhinavkulkarni
- opened
Hi,
I'm running this model using vLLM as follows:
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm
command: >
--model cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit
--max-model-len 250000
--reasoning-parser deepseek_r1
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
--gpu-memory-utilization 0.85
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- /shared/huggingface:/root/.cache/huggingface
# environment:
# - VLLM_USE_V1=0
ports:
- "8000:8000"
ipc: host
restart: unless-stopped
I see logs as follows:
vllm | (APIServer pid=1) INFO 11-06 04:20:48 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 14 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 0.0%
vllm | (APIServer pid=1) INFO 11-06 04:20:48 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 41.60 tokens/s, Accepted: 0 tokens, Drafted: 416 tokens, Per-position acceptance rate: 0.000, 0.000, Avg Draft acceptance rate: 0.0%
The average draft acceptance rate is always 0%. From logs, it is clear that the speculative decoder is drafting tokens, but the main model is not accepting any of them. Why is this the case?
Thanks!
Thank you for trying my model. I have released an update today with better accuracy and MTP support. Please redownload the weights :)
abhinavkulkarni
changed discussion status to
closed