Instructions to use alpharomercoma/vqwen3-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use alpharomercoma/vqwen3-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="alpharomercoma/vqwen3-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("alpharomercoma/vqwen3-4b")
model = AutoModelForImageTextToText.from_pretrained("alpharomercoma/vqwen3-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use alpharomercoma/vqwen3-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "alpharomercoma/vqwen3-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alpharomercoma/vqwen3-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/alpharomercoma/vqwen3-4b

SGLang

How to use alpharomercoma/vqwen3-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "alpharomercoma/vqwen3-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alpharomercoma/vqwen3-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "alpharomercoma/vqwen3-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alpharomercoma/vqwen3-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use alpharomercoma/vqwen3-4b with Docker Model Runner:
```
docker model run hf.co/alpharomercoma/vqwen3-4b
```

vqwen3-4b

A ready-to-use vision-language model built by swapping Vicuna for Qwen3-4B in the LLaVA-1.5 recipe. Everything is pre-wired: drop in a LlavaForConditionalGeneration loader, pass an image + a prompt, get text out. No rigging.

Vision tower: openai/clip-vit-large-patch14-336 (frozen)
Language model: Qwen/Qwen3-4B with LoRA merged back into the weights
Projector: 2-layer MLP (Linear 1024 → GELU → Linear 2560), trained stage-1 then stage-2
Image tokens: 576 (CLS stripped) per 336×336 image, spliced into the chat sequence

Quick start

import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "alpharomercoma/vqwen3-4b"
model     = LlavaForConditionalGeneration.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("my_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."},
    ],
}]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
reply = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(reply)

Training recipe

Two-stage reproduction of the LLaVA-1.5 recipe, both stages on a single H200 141 GB.

Stage 1 — feature alignment (projector only)

Data: liuhaotian/LLaVA-Pretrain (558,128 BLIP caption pairs)
Trainable: MLP projector only (~9.2 M params). CLIP + Qwen3-4B frozen.
Batch 256 (32 × grad_accum 8), LR 1e-3 cosine, warmup 0.03, bf16, 1 epoch
Loss masked on image sentinel; caption + EOS contribute
6.1 h, final train loss 2.87

Stage 2 — visual instruction tuning (projector + LoRA)

Data: liuhaotian/LLaVA-Instruct-150K (LLaVA-1.5 mix665k: COCO, GQA, OCR-VQA, TextVQA, VisualGenome + ShareGPT text-only; 665,286 records after filtering 12 dead image refs)
Trainable: projector (continues at LR 2e-5) + LoRA on Qwen3 (LR 2e-4)
LoRA: r=128, α=256, dropout 0.05, targets [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
Batch 128 (16 × grad_accum 8), cosine + warmup 0.03, bf16, 1 epoch
Liger-Kernel (FusedLinearCrossEntropy + Triton RoPE/RMSNorm/SwiGLU) — math-identical to the HF baseline
LengthGroupedSampler over mix665k for padding efficiency
Gradient checkpointing ON, SDPA attention
Conversation format: Qwen3 chat template, loss masked on non-assistant turns
17.1 h, final train loss 0.858

The stage-2 LoRA has been merged back into Qwen3's weights in this release, so loading is a single .from_pretrained() call.

Architecture details

The model is the transformers-standard LlavaForConditionalGeneration with:

vision_config: CLIP ViT-L/14-336 (fixed)
text_config: Qwen3-4B (with LoRA merged)
image_seq_length: 576
vision_feature_layer: −2 (penultimate hidden state)
vision_feature_select_strategy: "default" (strips CLS)
image_token_index: 151669 (the added <image> special token)
projector_hidden_act: "gelu"

Because these choices match the LLaVA class upstream, no custom code or trust_remote_code=True is required.

Limitations

Trained on LLaVA-Instruct-150K — inherits its distribution: English-heavy, mostly natural-image QA, OCR-light. Don't expect SOTA on GUI / document / chart tasks.
Single-image per conversation (splices one set of 576 tokens).
Not aligned for safety/refusals beyond Qwen3-4B's base behavior + whatever mix665k contributes.
Hallucinations are possible, especially for fine-grained object counts and spatial relations.

License

Apache 2.0 for the projector weights and LoRA-merged Qwen3 delta. Base models retain their original licenses: OpenAI CLIP (MIT), Qwen3-4B (Apache 2.0).