Qwen3.5-35B-A3B-NVFP4

This is a quantized version of Qwen/Qwen3.5-35B-A3B using the NVFP4 quantization scheme.

Please use nightly vLLM for support.

Changelog

02/03/2026: Re-quantized with MTP (multi-token prediction) weights preserved, enabling speculative decoding with vLLM.
25/02/2026: Initial upload.

Calibration

Samples: 512 (256 from each dataset)
Datasets:
- HuggingFaceH4/ultrachat_200k (train_sft split)
- nvidia/Nemotron-Post-Training-Dataset-v2 (chat split)
Max sequence length: 4096
All experts calibrated: moe_calibrate_all_experts=True

Creation

This model was created using VLLM's LLM Compressor with Qwen3.5 MoE support added via PR #2383. The PR adds a custom CalibrationQwen3MoeSparseMoeBlock that routes calibration data to all experts during quantization, ensuring every expert receives proper calibration for accurate NVFP4 quantization.

Downloads last month: 33,057

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sehyo/Qwen3.5-35B-A3B-NVFP4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(79)

this model