recommended vllm tool call parser?

#17
by lightenup - opened

Currently the vllm launch instructions are:

vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder

According to https://github.com/vllm-project/vllm/pull/25028 qwen3_xml is the more advanced tool call parser for qwen3 models also relevant for coder models (see comments by contributor Zhikaiiii). Should the instructions in the model card be updated, or is there another reason why the older non-streaming tool call parser qwen3_coder is recommended? Also I see that the older qwen3_coder tool parser is included in the model repository.

(I've already tried out qwen3_xml and encountered no issues so far.)

I encountered several issues when using qwen3_coder

I actually switch to llama.cpp (see https://huggingface.co/Qwen/Qwen3-Coder-Next/discussions/15 ) because vllm appears to not be able to work right. I have seen strange token confusions (see: https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic/discussions/2#6985a88282852b1ddaa4fb77 ) and had issues with streaming tool calls under the responses API (required for newer Codex versions).

(I haven't seen any token confusion with latest llama.cpp yet; only sometimes/often the model would declare to do a tool call, but then generates a EOS token. The tool call is generated after asking the model to continue.)

The symptom is: with vLLM using the default qwen3_coder parser, it produces an infinite stream of "!!!!!!!!!!!!!!" accompanied by next_token_id = 0.
Short inputs work fine, but this issue occurs with long inputs that include a tool call.
My solution was to switch --tool-call-parser to qwen3_xml, and the problem disappeared.

Sign up or log in to comment