Multi token prediction not working

by abhinavkulkarni - opened 28 days ago

28 days ago

Hi,

I'm running this model using vLLM as follows:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    command: >
      --model cpatonn/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit
      --max-model-len 250000
      --reasoning-parser deepseek_r1
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --gpu-memory-utilization 0.85
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /shared/huggingface:/root/.cache/huggingface
    # environment:
    #   - VLLM_USE_V1=0
    ports:
      - "8000:8000"
    ipc: host
    restart: unless-stopped

I see logs as follows:

vllm | (APIServer pid=1) INFO 11-06 04:20:48 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 14 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 0.0%
vllm | (APIServer pid=1) INFO 11-06 04:20:48 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 41.60 tokens/s, Accepted: 0 tokens, Drafted: 416 tokens, Per-position acceptance rate: 0.000, 0.000, Avg Draft acceptance rate: 0.0%

The average draft acceptance rate is always 0%. From logs, it is clear that the speculative decoder is drafting tokens, but the main model is not accepting any of them. Why is this the case?

Thanks!

cpatonn

Owner 25 days ago

Thank you for trying my model. I have released an update today with better accuracy and MTP support. Please redownload the weights :)

abhinavkulkarni changed discussion status to closed 23 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment