tmp1 / README.md

Upload folder using huggingface_hub

48a3d91 verified 2 months ago

8.09 kB

	---
	tags:
	- fp4
	- vllm
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	pipeline_tag: text-generation
	license: apache-2.0
	base_model: Qwen/Qwen3-VL-235B-A22B-Instruct
	---

	# Qwen3-VL-235B-A22B-Instruct-NVFP4

	## Model Overview
	- Model Architecture: Qwen/Qwen3-VL-235B-A22B-Instruct
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: FP4
	- Activation quantization: FP4
	- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
	- Release Date: 10/29/2025
	- Version: 1.0
	- Model Developers: RedHatAI

	This model is a quantized version of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct).
	It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.

	### Model Optimizations

	This model was obtained by quantizing the weights and activations of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1
	This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

	Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).

	## Deployment

	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	model_id = "RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4"
	number_gpus = 1

	sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Who are you?"},
	]

	prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

	outputs = llm.generate(prompts, sampling_params)

	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```

	vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

	## Creation

	This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.

	<details>

	```python
	import torch
	from datasets import load_dataset
	from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration

	from llmcompressor import oneshot
	from llmcompressor.modeling import replace_modules_for_calibration
	from llmcompressor.modifiers.quantization import QuantizationModifier
	from llmcompressor.utils import dispatch_for_generation

	# NOTE: Requires a minimum of transformers 4.57.0

	MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"


	# Load model.
	model = Qwen3VLMoeForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto")
	processor = AutoProcessor.from_pretrained(MODEL_ID)
	model = replace_modules_for_calibration(model)

	DATASET_ID = "neuralmagic/calibration"
	NUM_CALIBRATION_SAMPLES = 20
	MAX_SEQUENCE_LENGTH = 8192

	ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")


	def preprocess_function(example):
	messgages = []
	for message in example["messages"]:
	messgages.append(
	{
	"role": message["role"],
	"content": [{"type": "text", "text": message["content"]}],
	}
	)

	return processor.apply_chat_template(
	messgages,
	return_tensors="pt",
	padding=False,
	truncation=True,
	max_length=MAX_SEQUENCE_LENGTH,
	tokenize=True,
	add_special_tokens=False,
	return_dict=True,
	add_generation_prompt=False,
	)


	ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)


	def data_collator(batch):
	assert len(batch) == 1
	return {
	key: (
	torch.tensor(value)
	if key != "pixel_values"
	else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
	)
	for key, value in batch[0].items()
	}


	# Configure the quantization algorithm and scheme.
	# In this case, we:
	# * quantize the weights to fp4 with group-wise quantization
	# * quantize the activations to fp4 with dynamic group activations
	recipe = QuantizationModifier(
	targets="Linear",
	scheme="NVFP4",
	ignore=[
	"re:.*lm_head",
	"re:visual.*",
	"re:model.visual.*",
	"re:.*mlp.gate$",
	],
	)

	# Apply quantization.
	oneshot(
	model=model,
	recipe=recipe,
	max_seq_length=MAX_SEQUENCE_LENGTH,
	num_calibration_samples=NUM_CALIBRATION_SAMPLES,
	dataset=ds,
	data_collator=data_collator,
	)

	print("========== SAMPLE GENERATION ==============")
	dispatch_for_generation(model)
	input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
	output = model.generate(input_ids, max_new_tokens=20)
	print(processor.decode(output[0]))
	print("==========================================")


	# Save to disk in compressed-tensors format.
	SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
	model.save_pretrained(SAVE_DIR)
	processor.save_pretrained(SAVE_DIR)

	```
	</details>

	## Evaluation

	This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).

	### Accuracy
	<table>
	<thead>
	<tr>
	<th>Category</th>
	<th>Metric</th>
	<th>Qwen/Qwen3-VL-235B-A22B-Instruct</th>
	<th>RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (this model)</th>
	<th>Recovery</th>
	</tr>
	</thead>
	<tbody>
	<!-- OpenLLM -->
	<tr>
	<td rowspan="7"><b>OpenLLM</b></td>
	<td>arc_challenge</td>
	<td>72.95</td>
	<td>71.59</td>
	<td>98.13</td>
	</tr>
	<tr>
	<td>gsm8k</td>
	<td>90.37</td>
	<td>88.25</td>
	<td>97.65</td>
	</tr>
	<tr>
	<td>hellaswag</td>
	<td>87.94</td>
	<td>86.80</td>
	<td>98.70</td>
	</tr>
	<tr>
	<td>mmlu</td>
	<td>87.12</td>
	<td>86.22</td>
	<td>98.97</td>
	</tr>
	<tr>
	<td>truthfulqa_mc2</td>
	<td>63.31</td>
	<td>62.37</td>
	<td>98.52</td>
	</tr>
	<tr>
	<td>winogrande</td>
	<td>81.93</td>
	<td>80.43</td>
	<td>98.17</td>
	</tr>
	<tr>
	<td><b>Average</b></td>
	<td><b>80.60</b></td>
	<td><b>79.28</b></td>
	<td><b>98.35</b></td>
	</tr>
	<!-- Vision -->
	<tr>
	<td rowspan="7"><b>Vision</b></td>
	<td>mmmu_val</td>
	<td>63.56</td>
	<td>62.11</td>
	<td>97.71</td>
	</tr>
	<tr>
	<td>chartqa</td>
	<td>90.52</td>
	<td>89.00</td>
	<td>98.32</td>
	</tr>
	<tr>
	<td><b>Average</b></td>
	<td><b>77.04</b></td>
	<td><b>75.56</b></td>
	<td><b>98.08</b></td>
	</tr>
	</tbody>
	</table>



	### Reproduction

	The results were obtained using the following commands:

	<details>

	#### OpenLLM
	```
	lm_eval \
	--model vllm \
	--model_args pretrained="RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
	--apply_chat_template \
	--fewshot_as_multiturn \
	--tasks openllm \
	--batch_size auto
	```

	#### Vision
	```
	python3 -m lmms_eval \
	--model vllm \
	--model_args model=RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4,tensor_parallel_size=4,max_model_len=20000 \
	--tasks chartqa,mmmu_val \
	--batch_size 1
	```

	</details>