Image-Text-to-Text
Transformers
Safetensors
English
llava
vision-language
multimodal
qwen3
conversational

vqwen3-4b

A ready-to-use vision-language model built by swapping Vicuna for Qwen3-4B in the LLaVA-1.5 recipe. Everything is pre-wired: drop in a LlavaForConditionalGeneration loader, pass an image + a prompt, get text out. No rigging.

  • Vision tower: openai/clip-vit-large-patch14-336 (frozen)
  • Language model: Qwen/Qwen3-4B with LoRA merged back into the weights
  • Projector: 2-layer MLP (Linear 1024 → GELU → Linear 2560), trained stage-1 then stage-2
  • Image tokens: 576 (CLS stripped) per 336×336 image, spliced into the chat sequence

Quick start

import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "alpharomercoma/vqwen3-4b"
model     = LlavaForConditionalGeneration.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("my_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."},
    ],
}]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
reply = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(reply)

Training recipe

Two-stage reproduction of the LLaVA-1.5 recipe, both stages on a single H200 141 GB.

Stage 1 — feature alignment (projector only)

  • Data: liuhaotian/LLaVA-Pretrain (558,128 BLIP caption pairs)
  • Trainable: MLP projector only (~9.2 M params). CLIP + Qwen3-4B frozen.
  • Batch 256 (32 × grad_accum 8), LR 1e-3 cosine, warmup 0.03, bf16, 1 epoch
  • Loss masked on image sentinel; caption + EOS contribute
  • 6.1 h, final train loss 2.87

Stage 2 — visual instruction tuning (projector + LoRA)

  • Data: liuhaotian/LLaVA-Instruct-150K (LLaVA-1.5 mix665k: COCO, GQA, OCR-VQA, TextVQA, VisualGenome + ShareGPT text-only; 665,286 records after filtering 12 dead image refs)
  • Trainable: projector (continues at LR 2e-5) + LoRA on Qwen3 (LR 2e-4)
  • LoRA: r=128, α=256, dropout 0.05, targets [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
  • Batch 128 (16 × grad_accum 8), cosine + warmup 0.03, bf16, 1 epoch
  • Liger-Kernel (FusedLinearCrossEntropy + Triton RoPE/RMSNorm/SwiGLU) — math-identical to the HF baseline
  • LengthGroupedSampler over mix665k for padding efficiency
  • Gradient checkpointing ON, SDPA attention
  • Conversation format: Qwen3 chat template, loss masked on non-assistant turns
  • 17.1 h, final train loss 0.858

The stage-2 LoRA has been merged back into Qwen3's weights in this release, so loading is a single .from_pretrained() call.

Architecture details

The model is the transformers-standard LlavaForConditionalGeneration with:

  • vision_config: CLIP ViT-L/14-336 (fixed)
  • text_config: Qwen3-4B (with LoRA merged)
  • image_seq_length: 576
  • vision_feature_layer: −2 (penultimate hidden state)
  • vision_feature_select_strategy: "default" (strips CLS)
  • image_token_index: 151669 (the added <image> special token)
  • projector_hidden_act: "gelu"

Because these choices match the LLaVA class upstream, no custom code or trust_remote_code=True is required.

Limitations

  • Trained on LLaVA-Instruct-150K — inherits its distribution: English-heavy, mostly natural-image QA, OCR-light. Don't expect SOTA on GUI / document / chart tasks.
  • Single-image per conversation (splices one set of 576 tokens).
  • Not aligned for safety/refusals beyond Qwen3-4B's base behavior + whatever mix665k contributes.
  • Hallucinations are possible, especially for fine-grained object counts and spatial relations.

License

Apache 2.0 for the projector weights and LoRA-merged Qwen3 delta. Base models retain their original licenses: OpenAI CLIP (MIT), Qwen3-4B (Apache 2.0).

Citation / acknowledgements

  • LLaVA and LLaVA-1.5 for the recipe
  • Qwen/Qwen3-4B as the language backbone
  • openai/clip-vit-large-patch14-336 as the vision tower
  • Full training + inference source: github:vqwen (companion repo)
Downloads last month
17
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alpharomercoma/vqwen3-4b

Finetuned
Qwen/Qwen3-4B
Finetuned
(619)
this model

Datasets used to train alpharomercoma/vqwen3-4b

Papers for alpharomercoma/vqwen3-4b