---
license: cc-by-nc-4.0
language:
- pl
- en
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- CYFRAGOVPL/pllum-12b-nc-instruct
- google/siglip2-so400m-patch14-384
---

# LLaVA-PLLuM-12b-nc-instruct

This model is the first Polish-focused Vision-Language Model (VLM), created by extending the open-source LLaVA architecture with the PLLuM language model. 
Our pipeline integrates high-quality multimodal instruction tuning with PLLuM’s strong Polish linguistic abilities, resulting in a VLM that demonstrates significantly
improved understanding of Polish language, culture, and context-specific visual reasoning.

# Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)
- [Citation](#citation)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)

# Model Details

## Model Description

- **Developed by:** NASK PIB
- **Funded by:** NASK PIB
- **Shared by:** NASK PIB
- **Model type:** Multimodal (Image-Text-to-Text) / Vision Language Model
- **Language(s) (NLP):** Polish, English
- **License:** Model LLaVA-PLLuM-12b-nc-instruct is published under [PLLuM-1.0](https://huggingface.co/NASK-PIB/LLaVA-PLLuM-12b-nc-instruct/blob/main/LICENSE) license.

## Model Sources

- **Blogpost:** [Blogpost](https://huggingface.co/spaces/NASK-PIB/LLaVA-PLLuM)

# Uses

## Direct Use

The model is intended for research and development purposes, specifically focusing on multimodal tasks requiring the Polish language and cultural context. It can be used directly for:

* **Visual Question Answering (VQA) in Polish:** Users can provide an image and ask questions about it in Polish (e.g., "Co znajduje się na zdjęciu?").
* **Image Captioning:** Generating detailed descriptions of images in grammatically correct Polish.
* **Optical Character Recognition (OCR):** Extracting and interpreting text visible within images, including Polish documents.
* **Object Counting:** Performing simple enumeration of objects within a visual scene.
* **Multimodal Research:** Serving as a baseline or starting point for researchers developing non-English or bilingual Vision-Language Models (VLMs).

## Downstream Use

This model can be fine-tuned or integrated into larger applications to support specific use cases, such as:
* **Accessibility Tools:** Creating applications that describe surroundings or digital content to visually impaired Polish speakers.
* **E-commerce:** Generating automated product descriptions based on images for Polish marketplaces.
* **Educational Assistants:** Developing tutoring systems that can explain visual content (diagrams, historical photos) to students in Polish.
* **Specialized Fine-tuning:** The model can be further fine-tuned on domain-specific datasets (e.g., Polish medical imaging reports or legal document analysis) to improve performance in niche sectors.
  
## Out-of-Scope Use

* **Generation of Harmful Content:** Utilizing the model to generate hate speech, explicit content, or to facilitate harassment and disinformation.
* **High-Stakes Factual Retrieval:** Like all Large Language Models, this model can "hallucinate" or produce factually incorrect information. It should not be relied upon as a sole source of truth without human verification.
* **English-Primary Tasks:** While the model retains some English capabilities, it is optimized for Polish. Users seeking state-of-the-art performance strictly for English tasks should prefer models trained primarily on English data.

# Bias, Risks, and Limitations

- **Potential Hallucinations**: Like other LLMs, PLLuM may occasionally produce factually incorrect or fabricated content.
- **Sensitivity & Bias**: The current version has not undergone multimodal safety alignment. As a result, users may encounter biased behavior or toxic generations, particularly when the model is prompted with visual inputs.
- **Context Length**: Very long context tasks may challenge certain models, depending on memory constraints.

## Recommendations

**Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model. We recommend the following:**

* **Treat as a Research Proof-of-Concept:** This model represents a preliminary step toward robust Polish multimodal AI. It is not a finished commercial product. Users should exercise caution when applying it to real-world scenarios and should not deploy it in production environments without extensive domain-specific testing and guardrails.
* **Human Verification Required:** Like all Large Multimodal Models (LMMs), this model is prone to "hallucinations"—confidently stating incorrect facts or describing objects that are not present in the image. Always keep a human in the loop to verify outputs, especially for factual queries or quantitative tasks (e.g., counting objects).
* **Awareness of Translation Artifacts:** A significant portion of the instruction-tuning dataset (e.g., ALLaVA, LLaVA-Instruct) was automatically translated from English to Polish. While we employed filtering metrics (COMET), some linguistic unnaturalness or translation artifacts may persist in the model's responses.

# Training Details

## Training Data

The model was trained in two stages using a combination of translated open-source datasets and synthetic data, totaling approximately **2 million samples** with an 85% Polish / 15% English split.

**Stage 1: Pre-training (Feature Alignment)**
**Stage 2: Instruction Tuning (Visual Instruction Tuning)**

## Training Procedure

### Preprocessing

To create high-quality Polish multimodal data from English sources, a rigorous translation and filtering pipeline was employed:
1.  **Translation:** Source English datasets were translated using the **Tower+ 72B** model.
2.  **Filtering:** The **COMET** reference-free metric was used to filter out poor-quality translations.
3.  **Manual Review:** A portion of the data underwent manual expert filtering to ensure linguistic quality.
4.  **Dynamic Tiling:** Following LLaVA-NeXT, images are processed with dynamic tiling to support higher input resolutions.
   

### Speeds, Sizes, Times

* **Training Stages:** 2 Stages.
* **Epochs:** 1 Epoch for both stages.
* **Batch Size:** 256 (Stage 1), 128 (Stage 2).
* **Context Size:** 8,192 tokens.
* **Trainable Parameters:**
    * *Stage 1:* 30M (Projector only).
    * *Stage 2:* 12B (LLM via LoRA) + 400M (Vision Encoder) + 30M (Projector).
* **Learning Rates (Stage 2):** 2x10⁻⁶ (Vision), 2x10⁻⁵ (Projector & LLM).
* **LoRA Config:** Rank 128, Alpha 256, Dropout 0.05.

# Evaluation

## Testing Data, Factors & Metrics

### Testing Data

* **Quantitative:** **MMBench v1.1 (Development Split)**. The dataset was translated to Polish using Tower+ 72B and subsequently **manually corrected by experts** to remove translation artifacts (referred to as MMBench-PL).
* **Qualitative (Model-as-a-Judge):** **XM3600** (Polish subset), a dataset requiring accurate and culturally relevant image descriptions.

### Factors

* **Language:** Performance comparison between Polish (target) and English (source) capabilities.
* **Task Type:** Object recognition, OCR, commonsense reasoning, fine-grained perception, and cultural context recognition.

### Metrics

* **Accuracy:** Used for MMBench multiple-choice questions.
* **Win-rate (LLM-as-a-Judge):** Pairwise comparison using LLaVA-OneVision-72B to judge caption quality between PLLuM and baseline models (PaliGemma, Qwen2.5, Pixtral).

## Results

### Summary

The model demonstrates a significant advancement in Polish multimodal capabilities:
* **MMBench-PL:** Achieved **79.35%**, marking a **+9.55% improvement** over LLaVA-1.6-Vicuna-13B, while maintaining comparable English performance.
* **Captioning Quality:** Achieved better performance than PaliGemma-3B (65.28% win-rate vs. PaliGemma-3B), slightly outperforms LLaVA-1.6-Mistral-7B and LLaVA-1.6-Vicuna-13B, and shows competitive results—though slightly lower-compared to Qwen2.5-VL-7B and Pixtral-12B.
* **Qualitative Analysis:** The model shows superior handling of Polish grammar/morphology and correctly identifies Polish cultural elements (e.g., specific landmarks like the Palace of Culture and Science, regional food like Toruń gingerbread) where generic models often fail.

## Societal Impact Assessment

* **Cultural Inclusion:** This model helps bridge the gap in multimodal AI for the Polish language, allowing for technology that reflects local linguistic and cultural nuances rather than defaulting to US-centric norms.
* **Lack of Safety Alignment:** **Important:** As a research proof-of-concept, this model **has not undergone specific safety alignment** (e.g., RLHF) for the vision-language domain. Consequently, it may be more prone to generating biased, toxic, or inappropriate responses compared to fully commercialized models, especially when prompted with controversial visual content.
* **Reliability:** Users should be aware of the potential for hallucinations, particularly in OCR or counting tasks, and should not use the model for high-stakes decision-making.


# Technical Specifications

## Model Architecture and Objective

* **Architecture:** Based on the **LLaVA-NeXT** framework.
* **Language Model:** **PLLuM-12B-nc-instruct** (Polish-native, instruction-tuned).
* **Vision Encoder:** **SigLIP2 So400m/14, 384px** (Chosen for strong multilingual alignment).
* **Connector:** Two-layer MLP projector.
* **Objective:** The model uses a standard autoregressive language modeling objective, conditioned on visual inputs processed through the encoder and projector.

## Compute Infrastructure

### Hardware

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018129


# Model Card Contact

For questions or contributions, please reach out via: nlp@nask.pl


# How to Get Started with the Model

### Inference Example using Transformers

Use the code below to run the model. We recommend using transformers >= 4.56.2.

```python
import torch
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    dtype=torch.float16, 
    device_map="auto",
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
    images=image, 
    text=prompt, 
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256)

input_len = inputs.input_ids.shape[1]
generated_ids = output[0][input_len:]
print(processor.decode(generated_ids, skip_special_tokens=True))
```

### Inference with vLLM

You can also use the model via vLLM, see below example. We recommend using vllm >= 0.10.0.

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image
import requests


model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(
    model=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    limit_mm_per_prompt={"image": 1},
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(temperature=0.2, max_tokens=256)
output = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    },
    sampling_params=sampling_params
)

print(output[0].outputs[0].text)
```


# Citation

If you use this model, please cite the following paper:

```bibtex
@inproceedings{statkiewicz2026annotation,
  title     = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
  author    = {Statkiewicz, Grzegorz and
               Dobrzeniecka, Alicja and
               Seweryn, Karolina and
               Krasnod{\k e}bska, Aleksandra and
               Piosek, Karolina and
               Bogusz, Katarzyna and
               Cygert, Sebastian and
               Kusa, Wojciech},
  booktitle = {Proceedings of the Student Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)},
  year      = {2026},
  publisher = {Association for Computational Linguistics}
}
```