Multi token prediction not working

#6
by abhinavkulkarni - opened

Hi,

I'm running this model using vLLM as follows:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    command: >
      --model cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit
      --max-model-len 250000
      --reasoning-parser deepseek_r1
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --gpu-memory-utilization 0.85
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /shared/huggingface:/root/.cache/huggingface
    # environment:
    #   - VLLM_USE_V1=0
    ports:
      - "8000:8000"
    ipc: host
    restart: unless-stopped

I see logs as follows:

vllm | (APIServer pid=1) INFO 11-06 04:20:48 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 14 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 0.0%
vllm | (APIServer pid=1) INFO 11-06 04:20:48 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 41.60 tokens/s, Accepted: 0 tokens, Drafted: 416 tokens, Per-position acceptance rate: 0.000, 0.000, Avg Draft acceptance rate: 0.0%

The average draft acceptance rate is always 0%. From logs, it is clear that the speculative decoder is drafting tokens, but the main model is not accepting any of them. Why is this the case?

Thanks!

Thank you for trying my model. I have released an update today with better accuracy and MTP support. Please redownload the weights :)

abhinavkulkarni changed discussion status to closed

Sign up or log in to comment