Nice!
I've successfully run this model on a single RTX Pro 6000 Blackwell card (96 GB) with a 274000 token context. I suspect that I could push the context size even higher using the vllm option --kv-cache-dtype fp8, but I have not tried this yet. In any case, this model is working well for me. Thanks!
I saw you've asked for requests for other models to quantize using nvfp4. So... my request is for a NVFP4 quant for Qwen/Qwen3-235B-A22B-Thinking-2507.
I'm glad this quant was helpful.
I can run a quant of that Qwen3 model but it appears someone else has already made one: https://huggingface.co/NVFP4/Qwen3-235B-A22B-Thinking-2507-FP4
I should have mentioned that I've tried that quant from NVFP4, but have, thus far, been unable to load it. I do have 2 of the aforementioned Blackwell GPUs, so I think I have adequate resources. I've been able to run RedHatAI/Qwen3-235B-A22B-Instruct-2507-NVFP4, which I'd guess would have similar resource requirements.
Ah well in that case I'll add it to the list and run it and see if it works any differently than that other quant. From the tags it looks like they're using ModelOpt which I think is targeted at TensorRT-LLM. All of my quants are made with llm-compressor.
Thanks!