Qwen3-VL-4B-Instruct โ€” Pre-compiled for AWS Inferentia2

Pre-compiled artifacts for running Qwen/Qwen3-VL-4B-Instruct on AWS Inferentia2 (inf2) instances.

Configuration

Setting Value
Instance type inf2.xlarge / inf2.8xlarge (2 NeuronCores)
Tensor parallel 2
Batch size 1
Max sequence length 4096
Data type BF16
ISA Kernels All OFF
Buckets (context) 512, 1024, 4096
Buckets (token gen) 512, 1024, 4096
Vision buckets 512, 1024, 4096
SDK Neuron SDK 2.28 (DLAMI 20260227)
NxD Inference 0.8.x
vLLM 0.13.x

Usage

Prerequisites

  • AWS Inferentia2 instance (inf2.xlarge or larger)
  • Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
  • Pre-installed venv: /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/

Quick Start

# Activate environment
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# Download original model weights (required for weight loading)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-VL-4B-Instruct', local_dir='Qwen3-VL-4B-Instruct')
"

# Download pre-compiled artifacts
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-VL-4B-Instruct-neuron-inf2-tp2', local_dir='neuron-artifacts')
"

# Set environment
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference

Python API (vLLM)

import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp2"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen3-VL-4B-Instruct",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=2,
    max_num_seqs=1,
    max_model_len=4096,
    swap_space=0,
    additional_config=dict(override_neuron_config=dict(
        text_neuron_config={
            "batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
            "seq_len": 4096, "max_context_length": 4096,
            "torch_dtype": "bfloat16", "tp_degree": 2, "world_size": 2,
            "enable_bucketing": True,
            "context_encoding_buckets": [512, 1024, 4096],
            "token_generation_buckets": [512, 1024, 4096],
            "fused_qkv": True,
            "qkv_kernel_enabled": False, "mlp_kernel_enabled": False,
            "attn_kernel_enabled": False,
            "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
            "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16",
            "cast_type": "as-declared",
        },
        vision_neuron_config={
            "batch_size": 1, "seq_len": 4096, "max_context_length": 4096,
            "enable_bucketing": True, "buckets": [512, 1024, 4096],
            "world_size": 2, "tp_degree": 2,
            "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
            "cast_type": "as-declared",
            "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
            "fused_qkv": True,
            "attn_kernel_enabled": False, "mlp_kernel_enabled": False,
        },
    )),
    limit_mm_per_prompt={"image": 1},
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)

# Run inference
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
outputs = llm.generate([{"prompt": "Hello, what can you do?"}], sampling)
print(outputs[0].outputs[0].text)

Important Notes

  1. Original model weights required: This repo contains only compiled NEFFs and (if available) pre-sharded weight checkpoints. You still need the original Qwen/Qwen3-VL-4B-Instruct model weights on disk.

  2. tie_word_embeddings fix: The original model has tie_word_embeddings=true. You must either add lm_head.weight to the safetensors file or apply the monkey-patch (see the benchmark script for details).

  3. inf2.xlarge (16 GB RAM): System RAM is tight. Use swap_space=0 to avoid vLLM allocating swap memory. Pre-sharded checkpoints (if included) help reduce peak memory during weight loading.

  4. Artifacts are SDK-version and hardware specific: These artifacts only work on inf2 instances with Neuron SDK 2.28 (DLAMI 20260227).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support