Spaces:

NASK-PIB
/

LLaVA-PLLuM

Running

App Files Files Community

WojciechKusa commited on Mar 2

Commit

34cbc22

verified ·

1 Parent(s): 564134f

Upload index.html

Browse files

Files changed (1) hide show

index.html +146 -93

index.html CHANGED Viewed

@@ -56,24 +56,35 @@
               <div class="publication-links">
                 <!-- PDF Link. -->
                 <span class="link-block">
-                  <a class="external-link button is-normal is-rounded is-dark" disabled>
                     <span class="icon">
                       <i class="ai ai-arxiv"></i>
                     </span>
-                    <span>arXiv (soon)</span>
                   </a>
                 </span>
-                <!-- Code Link. -->
                 <span class="link-block">
-                  <a href="https://huggingface.co/NASK-PIB/LLaVA-PLLuM-12B-nc-instruct"
                     target="_blank"
                     class="external-link button is-normal is-rounded is-dark">
                     <span class="icon">
-                      <i data-lucide="download"></i>
                     </span>
-                    <span>Model</span>
                   </a>
                 </span>
               </div>
             </div>
           </div>
@@ -106,6 +117,9 @@
           <!-- Introduction. -->
           <div class="columns is-centered" id="introduction">
             <div class="column is-full-width">
               <h2 class="title is-3">Introduction</h2>
               <div class="content has-text-justified">
                 <p>
@@ -133,7 +147,8 @@
                   </li>
                 </ul>
                 <p>
-                  We trained our models using automatic translation combined with manual filtering, resulting in
                   approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
                   The models
                   accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
@@ -170,8 +185,9 @@
                   have been observed to improve fine-grained perception and OCR performance.
                 </p>
                 <p>
-                  As the language backbone, we use <strong>PLLuM-12B-nc-instruct-250715</strong>
-                  <a href="#ref-1">[1]</a>, a Polish-native, instruction-tuned LLM. For the vision tower, we replace
                   the CLIP-like encoder commonly used in LLaVA variants with
                   <strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
                   multilingual image-text alignment.
@@ -316,11 +332,13 @@
           <div class="columns is-centered" id="evaluation">
             <div class="column is-full-width">
               <h2 class="title is-3">Evaluation & Results</h2>
               We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
               quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
               in Polish.
               <h3 class="title is-4">MMBench v1.1</h3>
               <div class="content has-text-justified">
                 <p>
@@ -358,50 +376,65 @@
                     <tbody>
                       <tr>
                         <td>LLaVA-1.6-Mistral-7B</td>
-                        <td>66.41%</td>
-                        <td>72.37%</td>
                       </tr>
                       <tr>
                         <td>LLaVA-1.6-Vicuna-13B</td>
-                        <td>68.29%</td>
-                        <td>74.14%</td>
                       </tr>
                       <tr class="is-selected">
                         <td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
-                        <td><strong>73.89%</strong> <span class="tag is-success">+5.6%</span></td>
-                        <td><strong>73.89%</strong></td>
                       </tr>
                       <tr class="has-background-light">
                         <td colspan="3" class="has-text-centered">
                           <em>Additional Open-Source Models (different architectures)</em>
                         </td>
                       </tr>
                       <tr>
-                        <td>PaliGemma2-10B</td>
-                        <td>77.63%</td>
-                        <td>79.59%</td>
                       </tr>
                       <tr>
-                        <td>Pixtral-12B</td>
-                        <td>79.04%</td>
-                        <td>81.52%</td>
                       </tr>
                       <tr>
-                        <td>Qwen2.5-VL-7B</td>
-                        <td>74.38%</td>
-                        <td>79.02%</td>
                       </tr>
                     </tbody>
                   </table>
                 </div>
                 <p>
-                  <strong>Key Finding:</strong> Our model achieves <strong>+5.6% improvement</strong> on Polish
-                  benchmark compared to LLaVA-1.6-Vicuna-13B while maintaining comparable English performance,
-                  demonstrating significantly improved recognition of Polish context.
-                </p>
               </div>
-              <h3 class="title is-4">Model-as-a-Judge Evaluation</h3>
               <div class="content has-text-justified">
                 <p>
                   To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
@@ -409,58 +442,63 @@
                   Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
                   The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
                   descriptions of images, making it a suitable testbed for generative multimodal performance.
-                </p>
-                <p>
-                  We benchmarked our model against three competitive open-source vision-language models of different
                   architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
-                  evaluation.
-                </p>
-                <p>
-                  Because no Polish human-annotated standard for caption quality currently exists, we adopted an
-                  LLM-as-a-judge evaluation strategy using LLaVA-OneVision-72B, the strongest open-source VLM at the
-                  time of evaluation and capable of jointly processing the image and candidate captions.
-                  We used a pairwise comparison setup in which the judge is presented with an image and two captions and
-                  determines which description is better.
-                  Since prompt wording and input order can influence the outcome, we employed two prompt
-                  formulations—one presenting caption A before B and one reversing the order—and tested each with both
-                  model assignments (our model as A and as B).
-                  The resulting four judgments for each comparison were then averaged to obtain a stable final score.
-                </p>
-                <p>
-                  Together, these steps provide a controlled and replicable protocol for assessing Polish-language
-                  caption quality in the absence of human-annotated ground truth, while capturing the generative
-                  multimodal capabilities of the evaluated models.
                 </p>
                 <div class="table-container">
                   <table class="table is-bordered is-striped is-hoverable is-fullwidth">
                     <thead>
                       <tr>
-                        <th>Comparison</th>
-                        <th>Vision-Language Model Judge Winrate</th>
                       </tr>
                     </thead>
                     <tbody>
                       <tr>
-                        <td>LLaVA-PLLuM-12b-nc vs PaliGemma-3B</td>
-                        <td><strong>95.2%</strong> vs 4.8%</td>
                       </tr>
                       <tr>
-                        <td>LLaVA-PLLuM-12b-nc vs Qwen2.5-VL-7B</td>
-                        <td><strong>62.7%</strong> vs 37.3%</td>
                       </tr>
                       <tr>
-                        <td>LLaVA-PLLuM-12b-nc vs Pixtral-12B</td>
-                        <td><strong>59.3%</strong> vs 40.7%</td>
                       </tr>
                     </tbody>
                   </table>
                 </div>
-                <p>
-                  <strong>Key Finding:</strong> Across all comparisons, LLaVA-PLLuM is consistently preferred by the judge,
-                  indicating higher caption quality in Polish. Our qualitative analysis showed that LLaVA-PLLuM produces more
-                  grammatically correct sentences, maintains proper Polish morphology, and avoids inventing non-existent
-                  Polish words—a common failure mode observed in baseline models.
-                </p>
               </div>
             </div>
           </div>
@@ -497,7 +535,7 @@
                   machine-translated datasets, without human correction or manual
                   annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
                   model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
-                  Polish cultural context. We show gains of 5.6 percentage points over LLaVA-based baselines on a
                   manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
                   approach.
                 </p>
@@ -522,118 +560,123 @@
                 <ol>
                   <li id="ref-1">
                     PLLuM: A Family of Polish Large Language Models -
-                    <a href="https://arxiv.org/abs/2511.03823">
                       arXiv:2511.03823
                     </a>
                   </li>
                   <li id="ref-2">
                     PLLuM Model -
-                    <a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-3">
                     LLaVA-NeXT -
-                    <a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/">
                       Blog Post
                     </a>
                   </li>
                   <li id="ref-4">
                     SigLIP2 -
-                    <a href="https://arxiv.org/abs/2502.14786">
                       arXiv:2502.14786
                     </a>
                   </li>
                   <li id="ref-5">
                     ALLaVA -
-                    <a href="https://arxiv.org/abs/2402.11684">
                       arXiv:2402.11684
                     </a>
                   </li>
                   <li id="ref-6">
                     Visual Instruction Tuning (LLaVA) -
-                    <a href="https://arxiv.org/abs/2304.08485">
                       arXiv:2304.08485
                     </a>
                   </li>
                   <li id="ref-7">
                     Q-Instruct -
-                    <a href="https://arxiv.org/abs/2311.06783">
                       arXiv:2311.06783
                     </a>
                   </li>
                   <li id="ref-8">
                     LVIS-Instruct4V -
-                    <a href="https://arxiv.org/abs/2311.07574">
                       arXiv:2311.07574
                     </a>
                   </li>
                   <li id="ref-9">
                     A-OKVQA -
-                    <a href="https://arxiv.org/abs/2206.01718">
                       arXiv:2206.01718
                     </a>
                   </li>
                   <li id="ref-10">
                     SynthDoG -
-                    <a href="https://arxiv.org/abs/2111.15664">
                       arXiv:2111.15664
                     </a>
                   </li>
                   <li id="ref-11">
                     MS COCO -
-                    <a href="https://arxiv.org/abs/1405.0312">
                       arXiv:1405.0312
                     </a>
                   </li>
                   <li id="ref-12">
                     WIT Dataset -
-                    <a href="https://doi.org/10.1145/3404835.3463257">
                       ACM Digital Library
                     </a>
                   </li>
                   <li id="ref-13">
                     TallyQA -
-                    <a href="https://arxiv.org/abs/1810.12440">
                       arXiv:1810.12440
                     </a>
                   </li>
                   <li id="ref-14">
                     Tower+ Translation Model -
-                    <a href="https://huggingface.co/Unbabel/Tower-Plus-72B">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-15">
                     COMET Metric -
-                    <a href="https://unbabel.github.io/COMET/html/index.html">
                       Documentation
                     </a>
                   </li>
                   <li id="ref-16">
                     LLaVA-Pretrain Dataset -
-                    <a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-17">
                     MMBench -
-                    <a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard">
                       OpenCompass Leaderboard
                     </a>
                   </li>
                   <li id="ref-18">
                     Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
-                    <a href="https://aclanthology.org/2022.emnlp-main.45/">
                       EMNLP 2022
                     </a>
                   </li>
                   <li id="ref-19">
                     Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
-                    <a href="https://arxiv.org/abs/2310.03744">
                       arXiv:2310.03744
                     </a>
                 </ol>
               </div>
             </div>
@@ -644,13 +687,23 @@
             <div class="column is-full-width">
               <h2 class="title is-3">BibTeX</h2>
               <pre><code>
-@misc{statkiewicz2025llavapllum,
-  title={LLaVA-PLLuM: Building an Open Polish Vision-Language Model},
-  author={Statkiewicz, Grzegorz and Dobrzeniecka, Alicja and
-          Krasnodębska, Aleksandra and Cygert, Sebastian and Kusa, Wojciech},
-  year={2025},
-  note={Blog post}
 }
                 </code></pre>
             </div>
           </div>
@@ -665,7 +718,7 @@
         <p>
           This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
           under a
-          <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike
             4.0 International License</a>.
         </p>
       </div>

               <div class="publication-links">
                 <!-- PDF Link. -->
                 <span class="link-block">
+                  <a href="https://arxiv.org/abs/2602.14073"
+                  class="external-link button is-normal is-rounded is-dark">
                     <span class="icon">
                       <i class="ai ai-arxiv"></i>
                     </span>
+                    <span>arXiv</span>
                   </a>
                 </span>
+                <!-- Model Link. -->
                 <span class="link-block">
+                  <a href="https://huggingface.co/collections/NASK-PIB/llava-pllum"
                     target="_blank"
                     class="external-link button is-normal is-rounded is-dark">
                     <span class="icon">
+                      <i class="fas fa-share-square"></i>
                     </span>
+                    <span>Models</span>
                   </a>
                 </span>
+                <!-- Dataset Link. -->
+                <span class="link-block">
+                  <a href="https://huggingface.co/datasets/NASK-PIB/MMBench_V11_PL"
+                    target="_blank"
+                    class="external-link button is-normal is-rounded is-dark">
+                    <span class="icon">
+                      <i class="fas fa-database"></i>
+                    </span>
+                    <span>Dataset</span>
+                  </a>
               </div>
             </div>
           </div>
           <!-- Introduction. -->
           <div class="columns is-centered" id="introduction">
             <div class="column is-full-width">
+            <div class="notification is-info is-light">
+              <p><strong>🔥Update Feb 2026:</strong> Added two new models—NASK-PIB/LLaVA-PLLuM-12b-nc-instruct and NASK-PIB/LLaVA-Bielik-11b-v2.6-instruct—and released our Polish translation of the MMBench dataset [<a href="https://huggingface.co/collections/NASK-PIB/llava-pllum">HuggingFace</a>]</p>
+            </div>
               <h2 class="title is-3">Introduction</h2>
               <div class="content has-text-justified">
                 <p>
                   </li>
                 </ul>
                 <p>
+                  We also train a model based on the Bielik-11B-v2.6 language model <a href="#ref-2">[20]</a>, as an alternative backbone, to explore the impact of different LLMs on Polish multimodal performance.
+                  We train our models using automatic translation combined with manual filtering, resulting in
                   approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
                   The models
                   accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
                   have been observed to improve fine-grained perception and OCR performance.
                 </p>
                 <p>
+                  As the language backbone, we use three different Polish-native, instruction-tuned LLMs: <strong>PLLuM-12B-nc-instruct-250715</strong>
+                  <a href="#ref-1">[1]</a>, <strong>PLLuM-12B-nc-instruct</strong>, and <strong>Bielik-11b-v2.6</strong> <a href="#ref-2">[20]</a>.
+                  For the vision tower, we replace
                   the CLIP-like encoder commonly used in LLaVA variants with
                   <strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
                   multilingual image-text alignment.
           <div class="columns is-centered" id="evaluation">
             <div class="column is-full-width">
               <h2 class="title is-3">Evaluation & Results</h2>
+              <div class="content has-text-justified">
+              <p>
               We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
               quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
               in Polish.
+              </p>
+              </div>
               <h3 class="title is-4">MMBench v1.1</h3>
               <div class="content has-text-justified">
                 <p>
                     <tbody>
                       <tr>
                         <td>LLaVA-1.6-Mistral-7B</td>
+                        <td>68.18%</td>
+                        <td>76.54%</td>
                       </tr>
                       <tr>
                         <td>LLaVA-1.6-Vicuna-13B</td>
+                        <td>69.80%</td>
+                        <td>74.39%</td>
+                      </tr>
+                      <tr>
+                        <td>LLaVA-PLLuM-12b-nc-250715 (Ours)</td>
+                        <td>76.73%</td>
+                        <td>75.23%</td>
+                      </tr>
+                      <tr>
+                        <td>LLaVA-Bielik-11b-v2.6 (Ours)</td>
+                        <td>78.24%</td>
+                        <td>77.75%</td>
                       </tr>
                       <tr class="is-selected">
                         <td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
+                        <td><strong>79.35%</strong> <span class="tag is-success">+9.55%</span></td>
+                        <td><strong>78.43%</strong></td>
                       </tr>
                       <tr class="has-background-light">
                         <td colspan="3" class="has-text-centered">
                           <em>Additional Open-Source Models (different architectures)</em>
                         </td>
                       </tr>
                       <tr>
+                        <td>Qwen2.5-VL-7B</td>
+                        <td>75.56%</td>
+                        <td>80.62%</td>
                       </tr>
                       <tr>
+                        <td>PaliGemma2-10B</td>
+                        <td>78.39%</td>
+                        <td>80.46%</td>
                       </tr>
                       <tr>
+                        <td>Pixtral-12B</td>
+                        <td>82.06%</td>
+                        <td>84.31%</td>
                       </tr>
                     </tbody>
                   </table>
                 </div>
                 <p>
+                    <strong>Key Finding:</strong> Our best model achieves <strong>79.35%</strong> on the Polish MMBench
+                    v1.1 benchmark, representing a <strong>+9.55% improvement</strong> over LLaVA-1.6-Vicuna-13B (69.80%)
+                     while maintaining strong English performance at 78.43%. This demonstrates improved
+                     recognition of Polish context and linguistic understanding. When compared to other open-source models,
+                     LLaVA-PLLuM shows notably better Polish language understanding, outperforming Qwen2.5-VL-7B (75.56%)
+                     and PaliGemma2-10B (78.39%) on the Polish benchmark.
+              </p>
               </div>
+              <h3 class="title is-4">Open-ended Generation Evaluation</h3>
               <div class="content has-text-justified">
                 <p>
                   To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
                   Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
                   The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
                   descriptions of images, making it a suitable testbed for generative multimodal performance.
+                  We benchmarked our models against three competitive open-source vision-language models of different
                   architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
+                  evaluation. Because no Polish human-annotated standard for caption quality currently exists, we adopted a threefold evaluation strategy:
+                  (1) Open-source LLM and VLM judges, (2) Closed-source VLM judge, and (3) Human evaluation.
+                  Please refer to the <a href="https://arxiv.org/pdf/2602.14073" target="_blank">full paper</a> for complete details of the evaluation methodology and results.
                 </p>
                 <div class="table-container">
                   <table class="table is-bordered is-striped is-hoverable is-fullwidth">
                     <thead>
                       <tr>
+                        <th>Model</th>
+                        <th>LLaVA-PLLuM-12B-nc-250715</th>
+                        <th>LLaVA-PLLuM-12B-nc</th>
+                        <th>LLaVA-Bielik-11B-v2.6</th>
                       </tr>
                     </thead>
                     <tbody>
                       <tr>
+                        <td>LLaVA-1.6-Mistral-7B</td>
+                        <td>84.91%</td>
+                        <td><strong>85.81%</strong></td>
+                        <td>82.35%</td>
+                      </tr>
+                      <tr>
+                        <td>LLaVA-1.6-Vicuna-13B</td>
+                        <td>63.64%</td>
+                        <td><strong>66.71%</strong></td>
+                        <td>60.32%</td>
+                      </tr>
+                      <tr>
+                        <td>PaliGemma2-10B</td>
+                        <td>77.47%</td>
+                        <td><strong>77.53%</strong></td>
+                        <td>74.10%</td>
                       </tr>
                       <tr>
+                        <td>Pixtral-12B</td>
+                        <td>43.38%</td>
+                        <td>48.33%</td>
+                        <td>40.31%</td>
                       </tr>
                       <tr>
+                        <td>Qwen2.5-VL-7B</td>
+                        <td>42.69%</td>
+                        <td>43.15%</td>
+                        <td>34.76%</td>
                       </tr>
                     </tbody>
+                    <tfoot>
+                      <tr class="has-background-light">
+                        <td colspan="4" class="has-text-centered">
+                          <em>Preference rate (%) of our models over baseline judged by LLM (Llama-3.3-70B-Instruct) on XM3600 dataset for linguistic correctness of descriptions.</em>
+                        </td>
+                      </tr>
+                    </tfoot>
                   </table>
                 </div>
               </div>
             </div>
           </div>
                   machine-translated datasets, without human correction or manual
                   annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
                   model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
+                  Polish cultural context. We show gains of 9.5 percentage points over LLaVA-based baselines on a
                   manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
                   approach.
                 </p>
                 <ol>
                   <li id="ref-1">
                     PLLuM: A Family of Polish Large Language Models -
+                    <a href="https://arxiv.org/abs/2511.03823" target="_blank">
                       arXiv:2511.03823
                     </a>
                   </li>
                   <li id="ref-2">
                     PLLuM Model -
+                    <a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715" target="_blank">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-3">
                     LLaVA-NeXT -
+                    <a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/" target="_blank">
                       Blog Post
                     </a>
                   </li>
                   <li id="ref-4">
                     SigLIP2 -
+                    <a href="https://arxiv.org/abs/2502.14786" target="_blank">
                       arXiv:2502.14786
                     </a>
                   </li>
                   <li id="ref-5">
                     ALLaVA -
+                    <a href="https://arxiv.org/abs/2402.11684" target="_blank">
                       arXiv:2402.11684
                     </a>
                   </li>
                   <li id="ref-6">
                     Visual Instruction Tuning (LLaVA) -
+                    <a href="https://arxiv.org/abs/2304.08485" target="_blank">
                       arXiv:2304.08485
                     </a>
                   </li>
                   <li id="ref-7">
                     Q-Instruct -
+                    <a href="https://arxiv.org/abs/2311.06783" target="_blank">
                       arXiv:2311.06783
                     </a>
                   </li>
                   <li id="ref-8">
                     LVIS-Instruct4V -
+                    <a href="https://arxiv.org/abs/2311.07574" target="_blank">
                       arXiv:2311.07574
                     </a>
                   </li>
                   <li id="ref-9">
                     A-OKVQA -
+                    <a href="https://arxiv.org/abs/2206.01718" target="_blank">
                       arXiv:2206.01718
                     </a>
                   </li>
                   <li id="ref-10">
                     SynthDoG -
+                    <a href="https://arxiv.org/abs/2111.15664" target="_blank">
                       arXiv:2111.15664
                     </a>
                   </li>
                   <li id="ref-11">
                     MS COCO -
+                    <a href="https://arxiv.org/abs/1405.0312" target="_blank">
                       arXiv:1405.0312
                     </a>
                   </li>
                   <li id="ref-12">
                     WIT Dataset -
+                    <a href="https://doi.org/10.1145/3404835.3463257" target="_blank">
                       ACM Digital Library
                     </a>
                   </li>
                   <li id="ref-13">
                     TallyQA -
+                    <a href="https://arxiv.org/abs/1810.12440" target="_blank">
                       arXiv:1810.12440
                     </a>
                   </li>
                   <li id="ref-14">
                     Tower+ Translation Model -
+                    <a href="https://huggingface.co/Unbabel/Tower-Plus-72B" target="_blank">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-15">
                     COMET Metric -
+                    <a href="https://unbabel.github.io/COMET/html/index.html" target="_blank">
                       Documentation
                     </a>
                   </li>
                   <li id="ref-16">
                     LLaVA-Pretrain Dataset -
+                    <a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain" target="_blank">
                       Hugging Face
                     </a>
                   </li>
                   <li id="ref-17">
                     MMBench -
+                    <a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard" target="_blank">
                       OpenCompass Leaderboard
                     </a>
                   </li>
                   <li id="ref-18">
                     Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
+                    <a href="https://aclanthology.org/2022.emnlp-main.45/" target="_blank">
                       EMNLP 2022
                     </a>
                   </li>
                   <li id="ref-19">
                     Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
+                    <a href="https://arxiv.org/abs/2310.03744" target="_blank">
                       arXiv:2310.03744
                     </a>
+                  <li id="ref-20">
+                    Bielik 11B v2 Technical Report -
+                    <a href="https://arxiv.org/abs/2505.02410" target="_blank">
+                      arXiv:2505.02410
+                    </a>
+                  </li>
                 </ol>
               </div>
             </div>
             <div class="column is-full-width">
               <h2 class="title is-3">BibTeX</h2>
               <pre><code>
+@inproceedings{statkiewicz2026annotation,
+  title     = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
+  author    = {Statkiewicz, Grzegorz and
+               Dobrzeniecka, Alicja and
+               Seweryn, Karolina and
+               Krasnodębska, Aleksandra and
+               Piosek, Karolina and
+               Bogusz, Katarzyna and
+               Cygert, Sebastian and
+               Kusa, Wojciech},
+  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop},
+  month     = mar,
+  year      = {2026},
+  address   = {Rabat, Morocco},
+  publisher = {Association for Computational Linguistics}
 }
                 </code></pre>
             </div>
           </div>
         <p>
           This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
           under a
+          <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike
             4.0 International License</a>.
         </p>
       </div>