WojciechKusa commited on
Commit
34cbc22
·
verified ·
1 Parent(s): 564134f

Upload index.html

Browse files
Files changed (1) hide show
  1. index.html +146 -93
index.html CHANGED
@@ -56,24 +56,35 @@
56
  <div class="publication-links">
57
  <!-- PDF Link. -->
58
  <span class="link-block">
59
- <a class="external-link button is-normal is-rounded is-dark" disabled>
 
60
  <span class="icon">
61
  <i class="ai ai-arxiv"></i>
62
  </span>
63
- <span>arXiv (soon)</span>
64
  </a>
65
  </span>
66
- <!-- Code Link. -->
67
  <span class="link-block">
68
- <a href="https://huggingface.co/NASK-PIB/LLaVA-PLLuM-12B-nc-instruct"
69
  target="_blank"
70
  class="external-link button is-normal is-rounded is-dark">
71
  <span class="icon">
72
- <i data-lucide="download"></i>
73
  </span>
74
- <span>Model</span>
75
  </a>
76
  </span>
 
 
 
 
 
 
 
 
 
 
77
  </div>
78
  </div>
79
  </div>
@@ -106,6 +117,9 @@
106
  <!-- Introduction. -->
107
  <div class="columns is-centered" id="introduction">
108
  <div class="column is-full-width">
 
 
 
109
  <h2 class="title is-3">Introduction</h2>
110
  <div class="content has-text-justified">
111
  <p>
@@ -133,7 +147,8 @@
133
  </li>
134
  </ul>
135
  <p>
136
- We trained our models using automatic translation combined with manual filtering, resulting in
 
137
  approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
138
  The models
139
  accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
@@ -170,8 +185,9 @@
170
  have been observed to improve fine-grained perception and OCR performance.
171
  </p>
172
  <p>
173
- As the language backbone, we use <strong>PLLuM-12B-nc-instruct-250715</strong>
174
- <a href="#ref-1">[1]</a>, a Polish-native, instruction-tuned LLM. For the vision tower, we replace
 
175
  the CLIP-like encoder commonly used in LLaVA variants with
176
  <strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
177
  multilingual image-text alignment.
@@ -316,11 +332,13 @@
316
  <div class="columns is-centered" id="evaluation">
317
  <div class="column is-full-width">
318
  <h2 class="title is-3">Evaluation & Results</h2>
319
-
 
320
  We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
321
  quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
322
  in Polish.
323
-
 
324
  <h3 class="title is-4">MMBench v1.1</h3>
325
  <div class="content has-text-justified">
326
  <p>
@@ -358,50 +376,65 @@
358
  <tbody>
359
  <tr>
360
  <td>LLaVA-1.6-Mistral-7B</td>
361
- <td>66.41%</td>
362
- <td>72.37%</td>
363
  </tr>
364
  <tr>
365
  <td>LLaVA-1.6-Vicuna-13B</td>
366
- <td>68.29%</td>
367
- <td>74.14%</td>
 
 
 
 
 
 
 
 
 
 
368
  </tr>
369
  <tr class="is-selected">
370
  <td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
371
- <td><strong>73.89%</strong> <span class="tag is-success">+5.6%</span></td>
372
- <td><strong>73.89%</strong></td>
373
  </tr>
 
374
  <tr class="has-background-light">
375
  <td colspan="3" class="has-text-centered">
376
  <em>Additional Open-Source Models (different architectures)</em>
377
  </td>
378
  </tr>
 
379
  <tr>
380
- <td>PaliGemma2-10B</td>
381
- <td>77.63%</td>
382
- <td>79.59%</td>
383
  </tr>
384
  <tr>
385
- <td>Pixtral-12B</td>
386
- <td>79.04%</td>
387
- <td>81.52%</td>
388
  </tr>
389
  <tr>
390
- <td>Qwen2.5-VL-7B</td>
391
- <td>74.38%</td>
392
- <td>79.02%</td>
393
  </tr>
394
  </tbody>
395
  </table>
396
  </div>
397
  <p>
398
- <strong>Key Finding:</strong> Our model achieves <strong>+5.6% improvement</strong> on Polish
399
- benchmark compared to LLaVA-1.6-Vicuna-13B while maintaining comparable English performance,
400
- demonstrating significantly improved recognition of Polish context.
401
- </p>
 
 
 
402
  </div>
403
 
404
- <h3 class="title is-4">Model-as-a-Judge Evaluation</h3>
405
  <div class="content has-text-justified">
406
  <p>
407
  To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
@@ -409,58 +442,63 @@
409
  Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
410
  The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
411
  descriptions of images, making it a suitable testbed for generative multimodal performance.
412
- </p>
413
- <p>
414
- We benchmarked our model against three competitive open-source vision-language models of different
415
  architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
416
- evaluation.
417
- </p>
418
- <p>
419
- Because no Polish human-annotated standard for caption quality currently exists, we adopted an
420
- LLM-as-a-judge evaluation strategy using LLaVA-OneVision-72B, the strongest open-source VLM at the
421
- time of evaluation and capable of jointly processing the image and candidate captions.
422
- We used a pairwise comparison setup in which the judge is presented with an image and two captions and
423
- determines which description is better.
424
- Since prompt wording and input order can influence the outcome, we employed two prompt
425
- formulations—one presenting caption A before B and one reversing the order—and tested each with both
426
- model assignments (our model as A and as B).
427
- The resulting four judgments for each comparison were then averaged to obtain a stable final score.
428
- </p>
429
- <p>
430
- Together, these steps provide a controlled and replicable protocol for assessing Polish-language
431
- caption quality in the absence of human-annotated ground truth, while capturing the generative
432
- multimodal capabilities of the evaluated models.
433
  </p>
434
  <div class="table-container">
435
  <table class="table is-bordered is-striped is-hoverable is-fullwidth">
436
  <thead>
437
  <tr>
438
- <th>Comparison</th>
439
- <th>Vision-Language Model Judge Winrate</th>
 
 
440
  </tr>
441
  </thead>
442
  <tbody>
443
  <tr>
444
- <td>LLaVA-PLLuM-12b-nc vs PaliGemma-3B</td>
445
- <td><strong>95.2%</strong> vs 4.8%</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
446
  </tr>
447
  <tr>
448
- <td>LLaVA-PLLuM-12b-nc vs Qwen2.5-VL-7B</td>
449
- <td><strong>62.7%</strong> vs 37.3%</td>
 
 
450
  </tr>
451
  <tr>
452
- <td>LLaVA-PLLuM-12b-nc vs Pixtral-12B</td>
453
- <td><strong>59.3%</strong> vs 40.7%</td>
 
 
454
  </tr>
455
  </tbody>
 
 
 
 
 
 
 
456
  </table>
457
  </div>
458
- <p>
459
- <strong>Key Finding:</strong> Across all comparisons, LLaVA-PLLuM is consistently preferred by the judge,
460
- indicating higher caption quality in Polish. Our qualitative analysis showed that LLaVA-PLLuM produces more
461
- grammatically correct sentences, maintains proper Polish morphology, and avoids inventing non-existent
462
- Polish words—a common failure mode observed in baseline models.
463
- </p>
464
  </div>
465
  </div>
466
  </div>
@@ -497,7 +535,7 @@
497
  machine-translated datasets, without human correction or manual
498
  annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
499
  model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
500
- Polish cultural context. We show gains of 5.6 percentage points over LLaVA-based baselines on a
501
  manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
502
  approach.
503
  </p>
@@ -522,118 +560,123 @@
522
  <ol>
523
  <li id="ref-1">
524
  PLLuM: A Family of Polish Large Language Models -
525
- <a href="https://arxiv.org/abs/2511.03823">
526
  arXiv:2511.03823
527
  </a>
528
  </li>
529
  <li id="ref-2">
530
  PLLuM Model -
531
- <a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715">
532
  Hugging Face
533
  </a>
534
  </li>
535
  <li id="ref-3">
536
  LLaVA-NeXT -
537
- <a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/">
538
  Blog Post
539
  </a>
540
  </li>
541
  <li id="ref-4">
542
  SigLIP2 -
543
- <a href="https://arxiv.org/abs/2502.14786">
544
  arXiv:2502.14786
545
  </a>
546
  </li>
547
  <li id="ref-5">
548
  ALLaVA -
549
- <a href="https://arxiv.org/abs/2402.11684">
550
  arXiv:2402.11684
551
  </a>
552
  </li>
553
  <li id="ref-6">
554
  Visual Instruction Tuning (LLaVA) -
555
- <a href="https://arxiv.org/abs/2304.08485">
556
  arXiv:2304.08485
557
  </a>
558
  </li>
559
  <li id="ref-7">
560
  Q-Instruct -
561
- <a href="https://arxiv.org/abs/2311.06783">
562
  arXiv:2311.06783
563
  </a>
564
  </li>
565
  <li id="ref-8">
566
  LVIS-Instruct4V -
567
- <a href="https://arxiv.org/abs/2311.07574">
568
  arXiv:2311.07574
569
  </a>
570
  </li>
571
  <li id="ref-9">
572
  A-OKVQA -
573
- <a href="https://arxiv.org/abs/2206.01718">
574
  arXiv:2206.01718
575
  </a>
576
  </li>
577
  <li id="ref-10">
578
  SynthDoG -
579
- <a href="https://arxiv.org/abs/2111.15664">
580
  arXiv:2111.15664
581
  </a>
582
  </li>
583
  <li id="ref-11">
584
  MS COCO -
585
- <a href="https://arxiv.org/abs/1405.0312">
586
  arXiv:1405.0312
587
  </a>
588
  </li>
589
  <li id="ref-12">
590
  WIT Dataset -
591
- <a href="https://doi.org/10.1145/3404835.3463257">
592
  ACM Digital Library
593
  </a>
594
  </li>
595
  <li id="ref-13">
596
  TallyQA -
597
- <a href="https://arxiv.org/abs/1810.12440">
598
  arXiv:1810.12440
599
  </a>
600
  </li>
601
  <li id="ref-14">
602
  Tower+ Translation Model -
603
- <a href="https://huggingface.co/Unbabel/Tower-Plus-72B">
604
  Hugging Face
605
  </a>
606
  </li>
607
  <li id="ref-15">
608
  COMET Metric -
609
- <a href="https://unbabel.github.io/COMET/html/index.html">
610
  Documentation
611
  </a>
612
  </li>
613
  <li id="ref-16">
614
  LLaVA-Pretrain Dataset -
615
- <a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain">
616
  Hugging Face
617
  </a>
618
  </li>
619
  <li id="ref-17">
620
  MMBench -
621
- <a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard">
622
  OpenCompass Leaderboard
623
  </a>
624
  </li>
625
  <li id="ref-18">
626
  Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
627
- <a href="https://aclanthology.org/2022.emnlp-main.45/">
628
  EMNLP 2022
629
  </a>
630
  </li>
631
  <li id="ref-19">
632
  Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
633
- <a href="https://arxiv.org/abs/2310.03744">
634
  arXiv:2310.03744
635
  </a>
636
-
 
 
 
 
 
637
  </ol>
638
  </div>
639
  </div>
@@ -644,13 +687,23 @@
644
  <div class="column is-full-width">
645
  <h2 class="title is-3">BibTeX</h2>
646
  <pre><code>
647
- @misc{statkiewicz2025llavapllum,
648
- title={LLaVA-PLLuM: Building an Open Polish Vision-Language Model},
649
- author={Statkiewicz, Grzegorz and Dobrzeniecka, Alicja and
650
- Krasnodębska, Aleksandra and Cygert, Sebastian and Kusa, Wojciech},
651
- year={2025},
652
- note={Blog post}
 
 
 
 
 
 
 
 
 
653
  }
 
654
  </code></pre>
655
  </div>
656
  </div>
@@ -665,7 +718,7 @@
665
  <p>
666
  This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
667
  under a
668
- <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike
669
  4.0 International License</a>.
670
  </p>
671
  </div>
 
56
  <div class="publication-links">
57
  <!-- PDF Link. -->
58
  <span class="link-block">
59
+ <a href="https://arxiv.org/abs/2602.14073"
60
+ class="external-link button is-normal is-rounded is-dark">
61
  <span class="icon">
62
  <i class="ai ai-arxiv"></i>
63
  </span>
64
+ <span>arXiv</span>
65
  </a>
66
  </span>
67
+ <!-- Model Link. -->
68
  <span class="link-block">
69
+ <a href="https://huggingface.co/collections/NASK-PIB/llava-pllum"
70
  target="_blank"
71
  class="external-link button is-normal is-rounded is-dark">
72
  <span class="icon">
73
+ <i class="fas fa-share-square"></i>
74
  </span>
75
+ <span>Models</span>
76
  </a>
77
  </span>
78
+ <!-- Dataset Link. -->
79
+ <span class="link-block">
80
+ <a href="https://huggingface.co/datasets/NASK-PIB/MMBench_V11_PL"
81
+ target="_blank"
82
+ class="external-link button is-normal is-rounded is-dark">
83
+ <span class="icon">
84
+ <i class="fas fa-database"></i>
85
+ </span>
86
+ <span>Dataset</span>
87
+ </a>
88
  </div>
89
  </div>
90
  </div>
 
117
  <!-- Introduction. -->
118
  <div class="columns is-centered" id="introduction">
119
  <div class="column is-full-width">
120
+ <div class="notification is-info is-light">
121
+ <p><strong>🔥Update Feb 2026:</strong> Added two new models—NASK-PIB/LLaVA-PLLuM-12b-nc-instruct and NASK-PIB/LLaVA-Bielik-11b-v2.6-instruct—and released our Polish translation of the MMBench dataset [<a href="https://huggingface.co/collections/NASK-PIB/llava-pllum">HuggingFace</a>]</p>
122
+ </div>
123
  <h2 class="title is-3">Introduction</h2>
124
  <div class="content has-text-justified">
125
  <p>
 
147
  </li>
148
  </ul>
149
  <p>
150
+ We also train a model based on the Bielik-11B-v2.6 language model <a href="#ref-2">[20]</a>, as an alternative backbone, to explore the impact of different LLMs on Polish multimodal performance.
151
+ We train our models using automatic translation combined with manual filtering, resulting in
152
  approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
153
  The models
154
  accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
 
185
  have been observed to improve fine-grained perception and OCR performance.
186
  </p>
187
  <p>
188
+ As the language backbone, we use three different Polish-native, instruction-tuned LLMs: <strong>PLLuM-12B-nc-instruct-250715</strong>
189
+ <a href="#ref-1">[1]</a>, <strong>PLLuM-12B-nc-instruct</strong>, and <strong>Bielik-11b-v2.6</strong> <a href="#ref-2">[20]</a>.
190
+ For the vision tower, we replace
191
  the CLIP-like encoder commonly used in LLaVA variants with
192
  <strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
193
  multilingual image-text alignment.
 
332
  <div class="columns is-centered" id="evaluation">
333
  <div class="column is-full-width">
334
  <h2 class="title is-3">Evaluation & Results</h2>
335
+ <div class="content has-text-justified">
336
+ <p>
337
  We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
338
  quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
339
  in Polish.
340
+ </p>
341
+ </div>
342
  <h3 class="title is-4">MMBench v1.1</h3>
343
  <div class="content has-text-justified">
344
  <p>
 
376
  <tbody>
377
  <tr>
378
  <td>LLaVA-1.6-Mistral-7B</td>
379
+ <td>68.18%</td>
380
+ <td>76.54%</td>
381
  </tr>
382
  <tr>
383
  <td>LLaVA-1.6-Vicuna-13B</td>
384
+ <td>69.80%</td>
385
+ <td>74.39%</td>
386
+ </tr>
387
+ <tr>
388
+ <td>LLaVA-PLLuM-12b-nc-250715 (Ours)</td>
389
+ <td>76.73%</td>
390
+ <td>75.23%</td>
391
+ </tr>
392
+ <tr>
393
+ <td>LLaVA-Bielik-11b-v2.6 (Ours)</td>
394
+ <td>78.24%</td>
395
+ <td>77.75%</td>
396
  </tr>
397
  <tr class="is-selected">
398
  <td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
399
+ <td><strong>79.35%</strong> <span class="tag is-success">+9.55%</span></td>
400
+ <td><strong>78.43%</strong></td>
401
  </tr>
402
+
403
  <tr class="has-background-light">
404
  <td colspan="3" class="has-text-centered">
405
  <em>Additional Open-Source Models (different architectures)</em>
406
  </td>
407
  </tr>
408
+
409
  <tr>
410
+ <td>Qwen2.5-VL-7B</td>
411
+ <td>75.56%</td>
412
+ <td>80.62%</td>
413
  </tr>
414
  <tr>
415
+ <td>PaliGemma2-10B</td>
416
+ <td>78.39%</td>
417
+ <td>80.46%</td>
418
  </tr>
419
  <tr>
420
+ <td>Pixtral-12B</td>
421
+ <td>82.06%</td>
422
+ <td>84.31%</td>
423
  </tr>
424
  </tbody>
425
  </table>
426
  </div>
427
  <p>
428
+ <strong>Key Finding:</strong> Our best model achieves <strong>79.35%</strong> on the Polish MMBench
429
+ v1.1 benchmark, representing a <strong>+9.55% improvement</strong> over LLaVA-1.6-Vicuna-13B (69.80%)
430
+ while maintaining strong English performance at 78.43%. This demonstrates improved
431
+ recognition of Polish context and linguistic understanding. When compared to other open-source models,
432
+ LLaVA-PLLuM shows notably better Polish language understanding, outperforming Qwen2.5-VL-7B (75.56%)
433
+ and PaliGemma2-10B (78.39%) on the Polish benchmark.
434
+ </p>
435
  </div>
436
 
437
+ <h3 class="title is-4">Open-ended Generation Evaluation</h3>
438
  <div class="content has-text-justified">
439
  <p>
440
  To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
 
442
  Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
443
  The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
444
  descriptions of images, making it a suitable testbed for generative multimodal performance.
445
+ We benchmarked our models against three competitive open-source vision-language models of different
 
 
446
  architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
447
+ evaluation. Because no Polish human-annotated standard for caption quality currently exists, we adopted a threefold evaluation strategy:
448
+ (1) Open-source LLM and VLM judges, (2) Closed-source VLM judge, and (3) Human evaluation.
449
+ Please refer to the <a href="https://arxiv.org/pdf/2602.14073" target="_blank">full paper</a> for complete details of the evaluation methodology and results.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
450
  </p>
451
  <div class="table-container">
452
  <table class="table is-bordered is-striped is-hoverable is-fullwidth">
453
  <thead>
454
  <tr>
455
+ <th>Model</th>
456
+ <th>LLaVA-PLLuM-12B-nc-250715</th>
457
+ <th>LLaVA-PLLuM-12B-nc</th>
458
+ <th>LLaVA-Bielik-11B-v2.6</th>
459
  </tr>
460
  </thead>
461
  <tbody>
462
  <tr>
463
+ <td>LLaVA-1.6-Mistral-7B</td>
464
+ <td>84.91%</td>
465
+ <td><strong>85.81%</strong></td>
466
+ <td>82.35%</td>
467
+ </tr>
468
+ <tr>
469
+ <td>LLaVA-1.6-Vicuna-13B</td>
470
+ <td>63.64%</td>
471
+ <td><strong>66.71%</strong></td>
472
+ <td>60.32%</td>
473
+ </tr>
474
+ <tr>
475
+ <td>PaliGemma2-10B</td>
476
+ <td>77.47%</td>
477
+ <td><strong>77.53%</strong></td>
478
+ <td>74.10%</td>
479
  </tr>
480
  <tr>
481
+ <td>Pixtral-12B</td>
482
+ <td>43.38%</td>
483
+ <td>48.33%</td>
484
+ <td>40.31%</td>
485
  </tr>
486
  <tr>
487
+ <td>Qwen2.5-VL-7B</td>
488
+ <td>42.69%</td>
489
+ <td>43.15%</td>
490
+ <td>34.76%</td>
491
  </tr>
492
  </tbody>
493
+ <tfoot>
494
+ <tr class="has-background-light">
495
+ <td colspan="4" class="has-text-centered">
496
+ <em>Preference rate (%) of our models over baseline judged by LLM (Llama-3.3-70B-Instruct) on XM3600 dataset for linguistic correctness of descriptions.</em>
497
+ </td>
498
+ </tr>
499
+ </tfoot>
500
  </table>
501
  </div>
 
 
 
 
 
 
502
  </div>
503
  </div>
504
  </div>
 
535
  machine-translated datasets, without human correction or manual
536
  annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
537
  model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
538
+ Polish cultural context. We show gains of 9.5 percentage points over LLaVA-based baselines on a
539
  manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
540
  approach.
541
  </p>
 
560
  <ol>
561
  <li id="ref-1">
562
  PLLuM: A Family of Polish Large Language Models -
563
+ <a href="https://arxiv.org/abs/2511.03823" target="_blank">
564
  arXiv:2511.03823
565
  </a>
566
  </li>
567
  <li id="ref-2">
568
  PLLuM Model -
569
+ <a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715" target="_blank">
570
  Hugging Face
571
  </a>
572
  </li>
573
  <li id="ref-3">
574
  LLaVA-NeXT -
575
+ <a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/" target="_blank">
576
  Blog Post
577
  </a>
578
  </li>
579
  <li id="ref-4">
580
  SigLIP2 -
581
+ <a href="https://arxiv.org/abs/2502.14786" target="_blank">
582
  arXiv:2502.14786
583
  </a>
584
  </li>
585
  <li id="ref-5">
586
  ALLaVA -
587
+ <a href="https://arxiv.org/abs/2402.11684" target="_blank">
588
  arXiv:2402.11684
589
  </a>
590
  </li>
591
  <li id="ref-6">
592
  Visual Instruction Tuning (LLaVA) -
593
+ <a href="https://arxiv.org/abs/2304.08485" target="_blank">
594
  arXiv:2304.08485
595
  </a>
596
  </li>
597
  <li id="ref-7">
598
  Q-Instruct -
599
+ <a href="https://arxiv.org/abs/2311.06783" target="_blank">
600
  arXiv:2311.06783
601
  </a>
602
  </li>
603
  <li id="ref-8">
604
  LVIS-Instruct4V -
605
+ <a href="https://arxiv.org/abs/2311.07574" target="_blank">
606
  arXiv:2311.07574
607
  </a>
608
  </li>
609
  <li id="ref-9">
610
  A-OKVQA -
611
+ <a href="https://arxiv.org/abs/2206.01718" target="_blank">
612
  arXiv:2206.01718
613
  </a>
614
  </li>
615
  <li id="ref-10">
616
  SynthDoG -
617
+ <a href="https://arxiv.org/abs/2111.15664" target="_blank">
618
  arXiv:2111.15664
619
  </a>
620
  </li>
621
  <li id="ref-11">
622
  MS COCO -
623
+ <a href="https://arxiv.org/abs/1405.0312" target="_blank">
624
  arXiv:1405.0312
625
  </a>
626
  </li>
627
  <li id="ref-12">
628
  WIT Dataset -
629
+ <a href="https://doi.org/10.1145/3404835.3463257" target="_blank">
630
  ACM Digital Library
631
  </a>
632
  </li>
633
  <li id="ref-13">
634
  TallyQA -
635
+ <a href="https://arxiv.org/abs/1810.12440" target="_blank">
636
  arXiv:1810.12440
637
  </a>
638
  </li>
639
  <li id="ref-14">
640
  Tower+ Translation Model -
641
+ <a href="https://huggingface.co/Unbabel/Tower-Plus-72B" target="_blank">
642
  Hugging Face
643
  </a>
644
  </li>
645
  <li id="ref-15">
646
  COMET Metric -
647
+ <a href="https://unbabel.github.io/COMET/html/index.html" target="_blank">
648
  Documentation
649
  </a>
650
  </li>
651
  <li id="ref-16">
652
  LLaVA-Pretrain Dataset -
653
+ <a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain" target="_blank">
654
  Hugging Face
655
  </a>
656
  </li>
657
  <li id="ref-17">
658
  MMBench -
659
+ <a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard" target="_blank">
660
  OpenCompass Leaderboard
661
  </a>
662
  </li>
663
  <li id="ref-18">
664
  Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
665
+ <a href="https://aclanthology.org/2022.emnlp-main.45/" target="_blank">
666
  EMNLP 2022
667
  </a>
668
  </li>
669
  <li id="ref-19">
670
  Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
671
+ <a href="https://arxiv.org/abs/2310.03744" target="_blank">
672
  arXiv:2310.03744
673
  </a>
674
+ <li id="ref-20">
675
+ Bielik 11B v2 Technical Report -
676
+ <a href="https://arxiv.org/abs/2505.02410" target="_blank">
677
+ arXiv:2505.02410
678
+ </a>
679
+ </li>
680
  </ol>
681
  </div>
682
  </div>
 
687
  <div class="column is-full-width">
688
  <h2 class="title is-3">BibTeX</h2>
689
  <pre><code>
690
+ @inproceedings{statkiewicz2026annotation,
691
+ title = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
692
+ author = {Statkiewicz, Grzegorz and
693
+ Dobrzeniecka, Alicja and
694
+ Seweryn, Karolina and
695
+ Krasnodębska, Aleksandra and
696
+ Piosek, Karolina and
697
+ Bogusz, Katarzyna and
698
+ Cygert, Sebastian and
699
+ Kusa, Wojciech},
700
+ booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop},
701
+ month = mar,
702
+ year = {2026},
703
+ address = {Rabat, Morocco},
704
+ publisher = {Association for Computational Linguistics}
705
  }
706
+
707
  </code></pre>
708
  </div>
709
  </div>
 
718
  <p>
719
  This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
720
  under a
721
+ <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike
722
  4.0 International License</a>.
723
  </p>
724
  </div>