--- license: cc-by-nc-4.0 language: - pl - en library_name: transformers pipeline_tag: image-text-to-text base_model: - CYFRAGOVPL/pllum-12b-nc-instruct - google/siglip2-so400m-patch14-384 --- # LLaVA-PLLuM-12b-nc-instruct This model is the first Polish-focused Vision-Language Model (VLM), created by extending the open-source LLaVA architecture with the PLLuM language model. Our pipeline integrates high-quality multimodal instruction tuning with PLLuM’s strong Polish linguistic abilities, resulting in a VLM that demonstrates significantly improved understanding of Polish language, culture, and context-specific visual reasoning. # Table of Contents - [Model Details](#model-details) - [Uses](#uses) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Training Details](#training-details) - [Evaluation](#evaluation) - [Environmental Impact](#environmental-impact) - [Technical Specifications](#technical-specifications) - [Citation](#citation) - [How to Get Started with the Model](#how-to-get-started-with-the-model) # Model Details ## Model Description - **Developed by:** NASK PIB - **Funded by:** NASK PIB - **Shared by:** NASK PIB - **Model type:** Multimodal (Image-Text-to-Text) / Vision Language Model - **Language(s) (NLP):** Polish, English - **License:** Model LLaVA-PLLuM-12b-nc-instruct is published under [PLLuM-1.0](https://huggingface.co/NASK-PIB/LLaVA-PLLuM-12b-nc-instruct/blob/main/LICENSE) license. ## Model Sources - **Blogpost:** [Blogpost](https://huggingface.co/spaces/NASK-PIB/LLaVA-PLLuM) # Uses ## Direct Use The model is intended for research and development purposes, specifically focusing on multimodal tasks requiring the Polish language and cultural context. It can be used directly for: * **Visual Question Answering (VQA) in Polish:** Users can provide an image and ask questions about it in Polish (e.g., "Co znajduje się na zdjęciu?"). * **Image Captioning:** Generating detailed descriptions of images in grammatically correct Polish. * **Optical Character Recognition (OCR):** Extracting and interpreting text visible within images, including Polish documents. * **Object Counting:** Performing simple enumeration of objects within a visual scene. * **Multimodal Research:** Serving as a baseline or starting point for researchers developing non-English or bilingual Vision-Language Models (VLMs). ## Downstream Use This model can be fine-tuned or integrated into larger applications to support specific use cases, such as: * **Accessibility Tools:** Creating applications that describe surroundings or digital content to visually impaired Polish speakers. * **E-commerce:** Generating automated product descriptions based on images for Polish marketplaces. * **Educational Assistants:** Developing tutoring systems that can explain visual content (diagrams, historical photos) to students in Polish. * **Specialized Fine-tuning:** The model can be further fine-tuned on domain-specific datasets (e.g., Polish medical imaging reports or legal document analysis) to improve performance in niche sectors. ## Out-of-Scope Use * **Generation of Harmful Content:** Utilizing the model to generate hate speech, explicit content, or to facilitate harassment and disinformation. * **High-Stakes Factual Retrieval:** Like all Large Language Models, this model can "hallucinate" or produce factually incorrect information. It should not be relied upon as a sole source of truth without human verification. * **English-Primary Tasks:** While the model retains some English capabilities, it is optimized for Polish. Users seeking state-of-the-art performance strictly for English tasks should prefer models trained primarily on English data. # Bias, Risks, and Limitations - **Potential Hallucinations**: Like other LLMs, PLLuM may occasionally produce factually incorrect or fabricated content. - **Sensitivity & Bias**: The current version has not undergone multimodal safety alignment. As a result, users may encounter biased behavior or toxic generations, particularly when the model is prompted with visual inputs. - **Context Length**: Very long context tasks may challenge certain models, depending on memory constraints. ## Recommendations **Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model. We recommend the following:** * **Treat as a Research Proof-of-Concept:** This model represents a preliminary step toward robust Polish multimodal AI. It is not a finished commercial product. Users should exercise caution when applying it to real-world scenarios and should not deploy it in production environments without extensive domain-specific testing and guardrails. * **Human Verification Required:** Like all Large Multimodal Models (LMMs), this model is prone to "hallucinations"—confidently stating incorrect facts or describing objects that are not present in the image. Always keep a human in the loop to verify outputs, especially for factual queries or quantitative tasks (e.g., counting objects). * **Awareness of Translation Artifacts:** A significant portion of the instruction-tuning dataset (e.g., ALLaVA, LLaVA-Instruct) was automatically translated from English to Polish. While we employed filtering metrics (COMET), some linguistic unnaturalness or translation artifacts may persist in the model's responses. # Training Details ## Training Data The model was trained in two stages using a combination of translated open-source datasets and synthetic data, totaling approximately **2 million samples** with an 85% Polish / 15% English split. **Stage 1: Pre-training (Feature Alignment)** **Stage 2: Instruction Tuning (Visual Instruction Tuning)** ## Training Procedure ### Preprocessing To create high-quality Polish multimodal data from English sources, a rigorous translation and filtering pipeline was employed: 1. **Translation:** Source English datasets were translated using the **Tower+ 72B** model. 2. **Filtering:** The **COMET** reference-free metric was used to filter out poor-quality translations. 3. **Manual Review:** A portion of the data underwent manual expert filtering to ensure linguistic quality. 4. **Dynamic Tiling:** Following LLaVA-NeXT, images are processed with dynamic tiling to support higher input resolutions. ### Speeds, Sizes, Times * **Training Stages:** 2 Stages. * **Epochs:** 1 Epoch for both stages. * **Batch Size:** 256 (Stage 1), 128 (Stage 2). * **Context Size:** 8,192 tokens. * **Trainable Parameters:** * *Stage 1:* 30M (Projector only). * *Stage 2:* 12B (LLM via LoRA) + 400M (Vision Encoder) + 30M (Projector). * **Learning Rates (Stage 2):** 2x10⁻⁶ (Vision), 2x10⁻⁵ (Projector & LLM). * **LoRA Config:** Rank 128, Alpha 256, Dropout 0.05. # Evaluation ## Testing Data, Factors & Metrics ### Testing Data * **Quantitative:** **MMBench v1.1 (Development Split)**. The dataset was translated to Polish using Tower+ 72B and subsequently **manually corrected by experts** to remove translation artifacts (referred to as MMBench-PL). * **Qualitative (Model-as-a-Judge):** **XM3600** (Polish subset), a dataset requiring accurate and culturally relevant image descriptions. ### Factors * **Language:** Performance comparison between Polish (target) and English (source) capabilities. * **Task Type:** Object recognition, OCR, commonsense reasoning, fine-grained perception, and cultural context recognition. ### Metrics * **Accuracy:** Used for MMBench multiple-choice questions. * **Win-rate (LLM-as-a-Judge):** Pairwise comparison using LLaVA-OneVision-72B to judge caption quality between PLLuM and baseline models (PaliGemma, Qwen2.5, Pixtral). ## Results ### Summary The model demonstrates a significant advancement in Polish multimodal capabilities: * **MMBench-PL:** Achieved **79.35%**, marking a **+9.55% improvement** over LLaVA-1.6-Vicuna-13B, while maintaining comparable English performance. * **Captioning Quality:** Achieved better performance than PaliGemma-3B (65.28% win-rate vs. PaliGemma-3B), slightly outperforms LLaVA-1.6-Mistral-7B and LLaVA-1.6-Vicuna-13B, and shows competitive results—though slightly lower-compared to Qwen2.5-VL-7B and Pixtral-12B. * **Qualitative Analysis:** The model shows superior handling of Polish grammar/morphology and correctly identifies Polish cultural elements (e.g., specific landmarks like the Palace of Culture and Science, regional food like Toruń gingerbread) where generic models often fail. ## Societal Impact Assessment * **Cultural Inclusion:** This model helps bridge the gap in multimodal AI for the Polish language, allowing for technology that reflects local linguistic and cultural nuances rather than defaulting to US-centric norms. * **Lack of Safety Alignment:** **Important:** As a research proof-of-concept, this model **has not undergone specific safety alignment** (e.g., RLHF) for the vision-language domain. Consequently, it may be more prone to generating biased, toxic, or inappropriate responses compared to fully commercialized models, especially when prompted with controversial visual content. * **Reliability:** Users should be aware of the potential for hallucinations, particularly in OCR or counting tasks, and should not use the model for high-stakes decision-making. # Technical Specifications ## Model Architecture and Objective * **Architecture:** Based on the **LLaVA-NeXT** framework. * **Language Model:** **PLLuM-12B-nc-instruct** (Polish-native, instruction-tuned). * **Vision Encoder:** **SigLIP2 So400m/14, 384px** (Chosen for strong multilingual alignment). * **Connector:** Two-layer MLP projector. * **Objective:** The model uses a standard autoregressive language modeling objective, conditioned on visual inputs processed through the encoder and projector. ## Compute Infrastructure ### Hardware We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018129 # Model Card Contact For questions or contributions, please reach out via: nlp@nask.pl # How to Get Started with the Model ### Inference Example using Transformers Use the code below to run the model. We recommend using transformers >= 4.56.2. ```python import torch from transformers import LlavaNextForConditionalGeneration, AutoProcessor from PIL import Image import requests model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct" processor = AutoProcessor.from_pretrained(model_id) model = LlavaNextForConditionalGeneration.from_pretrained( model_id, dtype=torch.float16, device_map="auto", ) image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" image = Image.open(requests.get(image_url, stream=True).raw) conversation = [ { "role": "user", "content": "\nOpisz ten obrazek w szczegółach." # "Describe this image in detail" }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) output = model.generate(**inputs, max_new_tokens=256) input_len = inputs.input_ids.shape[1] generated_ids = output[0][input_len:] print(processor.decode(generated_ids, skip_special_tokens=True)) ``` ### Inference with vLLM You can also use the model via vLLM, see below example. We recommend using vllm >= 0.10.0. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer from PIL import Image import requests model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) llm = LLM( model=model_id, trust_remote_code=True, dtype="bfloat16", max_model_len=8192, limit_mm_per_prompt={"image": 1}, ) image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" image = Image.open(requests.get(image_url, stream=True).raw) conversation = [ { "role": "user", "content": "\nOpisz ten obrazek w szczegółach." # "Describe this image in detail" }, ] prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) sampling_params = SamplingParams(temperature=0.2, max_tokens=256) output = llm.generate( { "prompt": prompt, "multi_modal_data": {"image": image}, }, sampling_params=sampling_params ) print(output[0].outputs[0].text) ``` # Citation If you use this model, please cite the following paper: ```bibtex @inproceedings{statkiewicz2026annotation, title = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework}, author = {Statkiewicz, Grzegorz and Dobrzeniecka, Alicja and Seweryn, Karolina and Krasnod{\k e}bska, Aleksandra and Piosek, Karolina and Bogusz, Katarzyna and Cygert, Sebastian and Kusa, Wojciech}, booktitle = {Proceedings of the Student Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)}, year = {2026}, publisher = {Association for Computational Linguistics} } ```