Qwen3-VL-4B-Instruct โ Pre-compiled for AWS Inferentia2
Pre-compiled artifacts for running Qwen/Qwen3-VL-4B-Instruct on AWS Inferentia2 (inf2) instances.
Configuration
| Setting | Value |
|---|---|
| Instance type | inf2.xlarge / inf2.8xlarge (2 NeuronCores) |
| Tensor parallel | 2 |
| Batch size | 1 |
| Max sequence length | 4096 |
| Data type | BF16 |
| ISA Kernels | All OFF |
| Buckets (context) | 512, 1024, 4096 |
| Buckets (token gen) | 512, 1024, 4096 |
| Vision buckets | 512, 1024, 4096 |
| SDK | Neuron SDK 2.28 (DLAMI 20260227) |
| NxD Inference | 0.8.x |
| vLLM | 0.13.x |
Usage
Prerequisites
- AWS Inferentia2 instance (inf2.xlarge or larger)
- Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
- Pre-installed venv:
/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/
Quick Start
# Activate environment
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub
# Download original model weights (required for weight loading)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-VL-4B-Instruct', local_dir='Qwen3-VL-4B-Instruct')
"
# Download pre-compiled artifacts
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-VL-4B-Instruct-neuron-inf2-tp2', local_dir='neuron-artifacts')
"
# Set environment
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
Python API (vLLM)
import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp2"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen3-VL-4B-Instruct",
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=2,
max_num_seqs=1,
max_model_len=4096,
swap_space=0,
additional_config=dict(override_neuron_config=dict(
text_neuron_config={
"batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
"seq_len": 4096, "max_context_length": 4096,
"torch_dtype": "bfloat16", "tp_degree": 2, "world_size": 2,
"enable_bucketing": True,
"context_encoding_buckets": [512, 1024, 4096],
"token_generation_buckets": [512, 1024, 4096],
"fused_qkv": True,
"qkv_kernel_enabled": False, "mlp_kernel_enabled": False,
"attn_kernel_enabled": False,
"logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
"rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16",
"cast_type": "as-declared",
},
vision_neuron_config={
"batch_size": 1, "seq_len": 4096, "max_context_length": 4096,
"enable_bucketing": True, "buckets": [512, 1024, 4096],
"world_size": 2, "tp_degree": 2,
"torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
"cast_type": "as-declared",
"logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
"fused_qkv": True,
"attn_kernel_enabled": False, "mlp_kernel_enabled": False,
},
)),
limit_mm_per_prompt={"image": 1},
enable_prefix_caching=False,
enable_chunked_prefill=False,
)
# Run inference
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
outputs = llm.generate([{"prompt": "Hello, what can you do?"}], sampling)
print(outputs[0].outputs[0].text)
Important Notes
Original model weights required: This repo contains only compiled NEFFs and (if available) pre-sharded weight checkpoints. You still need the original
Qwen/Qwen3-VL-4B-Instructmodel weights on disk.tie_word_embeddingsfix: The original model hastie_word_embeddings=true. You must either addlm_head.weightto the safetensors file or apply the monkey-patch (see the benchmark script for details).inf2.xlarge (16 GB RAM): System RAM is tight. Use
swap_space=0to avoid vLLM allocating swap memory. Pre-sharded checkpoints (if included) help reduce peak memory during weight loading.Artifacts are SDK-version and hardware specific: These artifacts only work on inf2 instances with Neuron SDK 2.28 (DLAMI 20260227).