Spaces:
Running
Running
Upload index.html
Browse files- index.html +146 -93
index.html
CHANGED
|
@@ -56,24 +56,35 @@
|
|
| 56 |
<div class="publication-links">
|
| 57 |
<!-- PDF Link. -->
|
| 58 |
<span class="link-block">
|
| 59 |
-
<a
|
|
|
|
| 60 |
<span class="icon">
|
| 61 |
<i class="ai ai-arxiv"></i>
|
| 62 |
</span>
|
| 63 |
-
<span>arXiv
|
| 64 |
</a>
|
| 65 |
</span>
|
| 66 |
-
<!--
|
| 67 |
<span class="link-block">
|
| 68 |
-
<a href="https://huggingface.co/NASK-PIB/
|
| 69 |
target="_blank"
|
| 70 |
class="external-link button is-normal is-rounded is-dark">
|
| 71 |
<span class="icon">
|
| 72 |
-
<i
|
| 73 |
</span>
|
| 74 |
-
<span>
|
| 75 |
</a>
|
| 76 |
</span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
</div>
|
| 78 |
</div>
|
| 79 |
</div>
|
|
@@ -106,6 +117,9 @@
|
|
| 106 |
<!-- Introduction. -->
|
| 107 |
<div class="columns is-centered" id="introduction">
|
| 108 |
<div class="column is-full-width">
|
|
|
|
|
|
|
|
|
|
| 109 |
<h2 class="title is-3">Introduction</h2>
|
| 110 |
<div class="content has-text-justified">
|
| 111 |
<p>
|
|
@@ -133,7 +147,8 @@
|
|
| 133 |
</li>
|
| 134 |
</ul>
|
| 135 |
<p>
|
| 136 |
-
We
|
|
|
|
| 137 |
approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
|
| 138 |
The models
|
| 139 |
accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
|
|
@@ -170,8 +185,9 @@
|
|
| 170 |
have been observed to improve fine-grained perception and OCR performance.
|
| 171 |
</p>
|
| 172 |
<p>
|
| 173 |
-
As the language backbone, we use <strong>PLLuM-12B-nc-instruct-250715</strong>
|
| 174 |
-
<a href="#ref-1">[1]</a>,
|
|
|
|
| 175 |
the CLIP-like encoder commonly used in LLaVA variants with
|
| 176 |
<strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
|
| 177 |
multilingual image-text alignment.
|
|
@@ -316,11 +332,13 @@
|
|
| 316 |
<div class="columns is-centered" id="evaluation">
|
| 317 |
<div class="column is-full-width">
|
| 318 |
<h2 class="title is-3">Evaluation & Results</h2>
|
| 319 |
-
|
|
|
|
| 320 |
We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
|
| 321 |
quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
|
| 322 |
in Polish.
|
| 323 |
-
|
|
|
|
| 324 |
<h3 class="title is-4">MMBench v1.1</h3>
|
| 325 |
<div class="content has-text-justified">
|
| 326 |
<p>
|
|
@@ -358,50 +376,65 @@
|
|
| 358 |
<tbody>
|
| 359 |
<tr>
|
| 360 |
<td>LLaVA-1.6-Mistral-7B</td>
|
| 361 |
-
<td>
|
| 362 |
-
<td>
|
| 363 |
</tr>
|
| 364 |
<tr>
|
| 365 |
<td>LLaVA-1.6-Vicuna-13B</td>
|
| 366 |
-
<td>
|
| 367 |
-
<td>74.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 368 |
</tr>
|
| 369 |
<tr class="is-selected">
|
| 370 |
<td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
|
| 371 |
-
<td><strong>
|
| 372 |
-
<td><strong>
|
| 373 |
</tr>
|
|
|
|
| 374 |
<tr class="has-background-light">
|
| 375 |
<td colspan="3" class="has-text-centered">
|
| 376 |
<em>Additional Open-Source Models (different architectures)</em>
|
| 377 |
</td>
|
| 378 |
</tr>
|
|
|
|
| 379 |
<tr>
|
| 380 |
-
<td>
|
| 381 |
-
<td>
|
| 382 |
-
<td>
|
| 383 |
</tr>
|
| 384 |
<tr>
|
| 385 |
-
<td>
|
| 386 |
-
<td>
|
| 387 |
-
<td>
|
| 388 |
</tr>
|
| 389 |
<tr>
|
| 390 |
-
<td>
|
| 391 |
-
<td>
|
| 392 |
-
<td>
|
| 393 |
</tr>
|
| 394 |
</tbody>
|
| 395 |
</table>
|
| 396 |
</div>
|
| 397 |
<p>
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
|
|
|
|
|
|
|
|
|
| 402 |
</div>
|
| 403 |
|
| 404 |
-
<h3 class="title is-4">
|
| 405 |
<div class="content has-text-justified">
|
| 406 |
<p>
|
| 407 |
To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
|
|
@@ -409,58 +442,63 @@
|
|
| 409 |
Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
|
| 410 |
The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
|
| 411 |
descriptions of images, making it a suitable testbed for generative multimodal performance.
|
| 412 |
-
|
| 413 |
-
<p>
|
| 414 |
-
We benchmarked our model against three competitive open-source vision-language models of different
|
| 415 |
architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
|
| 416 |
-
evaluation.
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
Because no Polish human-annotated standard for caption quality currently exists, we adopted an
|
| 420 |
-
LLM-as-a-judge evaluation strategy using LLaVA-OneVision-72B, the strongest open-source VLM at the
|
| 421 |
-
time of evaluation and capable of jointly processing the image and candidate captions.
|
| 422 |
-
We used a pairwise comparison setup in which the judge is presented with an image and two captions and
|
| 423 |
-
determines which description is better.
|
| 424 |
-
Since prompt wording and input order can influence the outcome, we employed two prompt
|
| 425 |
-
formulations—one presenting caption A before B and one reversing the order—and tested each with both
|
| 426 |
-
model assignments (our model as A and as B).
|
| 427 |
-
The resulting four judgments for each comparison were then averaged to obtain a stable final score.
|
| 428 |
-
</p>
|
| 429 |
-
<p>
|
| 430 |
-
Together, these steps provide a controlled and replicable protocol for assessing Polish-language
|
| 431 |
-
caption quality in the absence of human-annotated ground truth, while capturing the generative
|
| 432 |
-
multimodal capabilities of the evaluated models.
|
| 433 |
</p>
|
| 434 |
<div class="table-container">
|
| 435 |
<table class="table is-bordered is-striped is-hoverable is-fullwidth">
|
| 436 |
<thead>
|
| 437 |
<tr>
|
| 438 |
-
<th>
|
| 439 |
-
<th>
|
|
|
|
|
|
|
| 440 |
</tr>
|
| 441 |
</thead>
|
| 442 |
<tbody>
|
| 443 |
<tr>
|
| 444 |
-
<td>LLaVA-
|
| 445 |
-
<td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 446 |
</tr>
|
| 447 |
<tr>
|
| 448 |
-
<td>
|
| 449 |
-
<td>
|
|
|
|
|
|
|
| 450 |
</tr>
|
| 451 |
<tr>
|
| 452 |
-
<td>
|
| 453 |
-
<td>
|
|
|
|
|
|
|
| 454 |
</tr>
|
| 455 |
</tbody>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 456 |
</table>
|
| 457 |
</div>
|
| 458 |
-
<p>
|
| 459 |
-
<strong>Key Finding:</strong> Across all comparisons, LLaVA-PLLuM is consistently preferred by the judge,
|
| 460 |
-
indicating higher caption quality in Polish. Our qualitative analysis showed that LLaVA-PLLuM produces more
|
| 461 |
-
grammatically correct sentences, maintains proper Polish morphology, and avoids inventing non-existent
|
| 462 |
-
Polish words—a common failure mode observed in baseline models.
|
| 463 |
-
</p>
|
| 464 |
</div>
|
| 465 |
</div>
|
| 466 |
</div>
|
|
@@ -497,7 +535,7 @@
|
|
| 497 |
machine-translated datasets, without human correction or manual
|
| 498 |
annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
|
| 499 |
model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
|
| 500 |
-
Polish cultural context. We show gains of
|
| 501 |
manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
|
| 502 |
approach.
|
| 503 |
</p>
|
|
@@ -522,118 +560,123 @@
|
|
| 522 |
<ol>
|
| 523 |
<li id="ref-1">
|
| 524 |
PLLuM: A Family of Polish Large Language Models -
|
| 525 |
-
<a href="https://arxiv.org/abs/2511.03823">
|
| 526 |
arXiv:2511.03823
|
| 527 |
</a>
|
| 528 |
</li>
|
| 529 |
<li id="ref-2">
|
| 530 |
PLLuM Model -
|
| 531 |
-
<a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715">
|
| 532 |
Hugging Face
|
| 533 |
</a>
|
| 534 |
</li>
|
| 535 |
<li id="ref-3">
|
| 536 |
LLaVA-NeXT -
|
| 537 |
-
<a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/">
|
| 538 |
Blog Post
|
| 539 |
</a>
|
| 540 |
</li>
|
| 541 |
<li id="ref-4">
|
| 542 |
SigLIP2 -
|
| 543 |
-
<a href="https://arxiv.org/abs/2502.14786">
|
| 544 |
arXiv:2502.14786
|
| 545 |
</a>
|
| 546 |
</li>
|
| 547 |
<li id="ref-5">
|
| 548 |
ALLaVA -
|
| 549 |
-
<a href="https://arxiv.org/abs/2402.11684">
|
| 550 |
arXiv:2402.11684
|
| 551 |
</a>
|
| 552 |
</li>
|
| 553 |
<li id="ref-6">
|
| 554 |
Visual Instruction Tuning (LLaVA) -
|
| 555 |
-
<a href="https://arxiv.org/abs/2304.08485">
|
| 556 |
arXiv:2304.08485
|
| 557 |
</a>
|
| 558 |
</li>
|
| 559 |
<li id="ref-7">
|
| 560 |
Q-Instruct -
|
| 561 |
-
<a href="https://arxiv.org/abs/2311.06783">
|
| 562 |
arXiv:2311.06783
|
| 563 |
</a>
|
| 564 |
</li>
|
| 565 |
<li id="ref-8">
|
| 566 |
LVIS-Instruct4V -
|
| 567 |
-
<a href="https://arxiv.org/abs/2311.07574">
|
| 568 |
arXiv:2311.07574
|
| 569 |
</a>
|
| 570 |
</li>
|
| 571 |
<li id="ref-9">
|
| 572 |
A-OKVQA -
|
| 573 |
-
<a href="https://arxiv.org/abs/2206.01718">
|
| 574 |
arXiv:2206.01718
|
| 575 |
</a>
|
| 576 |
</li>
|
| 577 |
<li id="ref-10">
|
| 578 |
SynthDoG -
|
| 579 |
-
<a href="https://arxiv.org/abs/2111.15664">
|
| 580 |
arXiv:2111.15664
|
| 581 |
</a>
|
| 582 |
</li>
|
| 583 |
<li id="ref-11">
|
| 584 |
MS COCO -
|
| 585 |
-
<a href="https://arxiv.org/abs/1405.0312">
|
| 586 |
arXiv:1405.0312
|
| 587 |
</a>
|
| 588 |
</li>
|
| 589 |
<li id="ref-12">
|
| 590 |
WIT Dataset -
|
| 591 |
-
<a href="https://doi.org/10.1145/3404835.3463257">
|
| 592 |
ACM Digital Library
|
| 593 |
</a>
|
| 594 |
</li>
|
| 595 |
<li id="ref-13">
|
| 596 |
TallyQA -
|
| 597 |
-
<a href="https://arxiv.org/abs/1810.12440">
|
| 598 |
arXiv:1810.12440
|
| 599 |
</a>
|
| 600 |
</li>
|
| 601 |
<li id="ref-14">
|
| 602 |
Tower+ Translation Model -
|
| 603 |
-
<a href="https://huggingface.co/Unbabel/Tower-Plus-72B">
|
| 604 |
Hugging Face
|
| 605 |
</a>
|
| 606 |
</li>
|
| 607 |
<li id="ref-15">
|
| 608 |
COMET Metric -
|
| 609 |
-
<a href="https://unbabel.github.io/COMET/html/index.html">
|
| 610 |
Documentation
|
| 611 |
</a>
|
| 612 |
</li>
|
| 613 |
<li id="ref-16">
|
| 614 |
LLaVA-Pretrain Dataset -
|
| 615 |
-
<a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain">
|
| 616 |
Hugging Face
|
| 617 |
</a>
|
| 618 |
</li>
|
| 619 |
<li id="ref-17">
|
| 620 |
MMBench -
|
| 621 |
-
<a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard">
|
| 622 |
OpenCompass Leaderboard
|
| 623 |
</a>
|
| 624 |
</li>
|
| 625 |
<li id="ref-18">
|
| 626 |
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
|
| 627 |
-
<a href="https://aclanthology.org/2022.emnlp-main.45/">
|
| 628 |
EMNLP 2022
|
| 629 |
</a>
|
| 630 |
</li>
|
| 631 |
<li id="ref-19">
|
| 632 |
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
|
| 633 |
-
<a href="https://arxiv.org/abs/2310.03744">
|
| 634 |
arXiv:2310.03744
|
| 635 |
</a>
|
| 636 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 637 |
</ol>
|
| 638 |
</div>
|
| 639 |
</div>
|
|
@@ -644,13 +687,23 @@
|
|
| 644 |
<div class="column is-full-width">
|
| 645 |
<h2 class="title is-3">BibTeX</h2>
|
| 646 |
<pre><code>
|
| 647 |
-
@
|
| 648 |
-
title={
|
| 649 |
-
author={Statkiewicz, Grzegorz and
|
| 650 |
-
|
| 651 |
-
|
| 652 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 653 |
}
|
|
|
|
| 654 |
</code></pre>
|
| 655 |
</div>
|
| 656 |
</div>
|
|
@@ -665,7 +718,7 @@
|
|
| 665 |
<p>
|
| 666 |
This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
|
| 667 |
under a
|
| 668 |
-
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike
|
| 669 |
4.0 International License</a>.
|
| 670 |
</p>
|
| 671 |
</div>
|
|
|
|
| 56 |
<div class="publication-links">
|
| 57 |
<!-- PDF Link. -->
|
| 58 |
<span class="link-block">
|
| 59 |
+
<a href="https://arxiv.org/abs/2602.14073"
|
| 60 |
+
class="external-link button is-normal is-rounded is-dark">
|
| 61 |
<span class="icon">
|
| 62 |
<i class="ai ai-arxiv"></i>
|
| 63 |
</span>
|
| 64 |
+
<span>arXiv</span>
|
| 65 |
</a>
|
| 66 |
</span>
|
| 67 |
+
<!-- Model Link. -->
|
| 68 |
<span class="link-block">
|
| 69 |
+
<a href="https://huggingface.co/collections/NASK-PIB/llava-pllum"
|
| 70 |
target="_blank"
|
| 71 |
class="external-link button is-normal is-rounded is-dark">
|
| 72 |
<span class="icon">
|
| 73 |
+
<i class="fas fa-share-square"></i>
|
| 74 |
</span>
|
| 75 |
+
<span>Models</span>
|
| 76 |
</a>
|
| 77 |
</span>
|
| 78 |
+
<!-- Dataset Link. -->
|
| 79 |
+
<span class="link-block">
|
| 80 |
+
<a href="https://huggingface.co/datasets/NASK-PIB/MMBench_V11_PL"
|
| 81 |
+
target="_blank"
|
| 82 |
+
class="external-link button is-normal is-rounded is-dark">
|
| 83 |
+
<span class="icon">
|
| 84 |
+
<i class="fas fa-database"></i>
|
| 85 |
+
</span>
|
| 86 |
+
<span>Dataset</span>
|
| 87 |
+
</a>
|
| 88 |
</div>
|
| 89 |
</div>
|
| 90 |
</div>
|
|
|
|
| 117 |
<!-- Introduction. -->
|
| 118 |
<div class="columns is-centered" id="introduction">
|
| 119 |
<div class="column is-full-width">
|
| 120 |
+
<div class="notification is-info is-light">
|
| 121 |
+
<p><strong>🔥Update Feb 2026:</strong> Added two new models—NASK-PIB/LLaVA-PLLuM-12b-nc-instruct and NASK-PIB/LLaVA-Bielik-11b-v2.6-instruct—and released our Polish translation of the MMBench dataset [<a href="https://huggingface.co/collections/NASK-PIB/llava-pllum">HuggingFace</a>]</p>
|
| 122 |
+
</div>
|
| 123 |
<h2 class="title is-3">Introduction</h2>
|
| 124 |
<div class="content has-text-justified">
|
| 125 |
<p>
|
|
|
|
| 147 |
</li>
|
| 148 |
</ul>
|
| 149 |
<p>
|
| 150 |
+
We also train a model based on the Bielik-11B-v2.6 language model <a href="#ref-2">[20]</a>, as an alternative backbone, to explore the impact of different LLMs on Polish multimodal performance.
|
| 151 |
+
We train our models using automatic translation combined with manual filtering, resulting in
|
| 152 |
approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning.
|
| 153 |
The models
|
| 154 |
accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as
|
|
|
|
| 185 |
have been observed to improve fine-grained perception and OCR performance.
|
| 186 |
</p>
|
| 187 |
<p>
|
| 188 |
+
As the language backbone, we use three different Polish-native, instruction-tuned LLMs: <strong>PLLuM-12B-nc-instruct-250715</strong>
|
| 189 |
+
<a href="#ref-1">[1]</a>, <strong>PLLuM-12B-nc-instruct</strong>, and <strong>Bielik-11b-v2.6</strong> <a href="#ref-2">[20]</a>.
|
| 190 |
+
For the vision tower, we replace
|
| 191 |
the CLIP-like encoder commonly used in LLaVA variants with
|
| 192 |
<strong>SigLIP2 So400m/14, 384px</strong> <a href="#ref-4">[4]</a>, selected for its strong
|
| 193 |
multilingual image-text alignment.
|
|
|
|
| 332 |
<div class="columns is-centered" id="evaluation">
|
| 333 |
<div class="column is-full-width">
|
| 334 |
<h2 class="title is-3">Evaluation & Results</h2>
|
| 335 |
+
<div class="content has-text-justified">
|
| 336 |
+
<p>
|
| 337 |
We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
|
| 338 |
quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
|
| 339 |
in Polish.
|
| 340 |
+
</p>
|
| 341 |
+
</div>
|
| 342 |
<h3 class="title is-4">MMBench v1.1</h3>
|
| 343 |
<div class="content has-text-justified">
|
| 344 |
<p>
|
|
|
|
| 376 |
<tbody>
|
| 377 |
<tr>
|
| 378 |
<td>LLaVA-1.6-Mistral-7B</td>
|
| 379 |
+
<td>68.18%</td>
|
| 380 |
+
<td>76.54%</td>
|
| 381 |
</tr>
|
| 382 |
<tr>
|
| 383 |
<td>LLaVA-1.6-Vicuna-13B</td>
|
| 384 |
+
<td>69.80%</td>
|
| 385 |
+
<td>74.39%</td>
|
| 386 |
+
</tr>
|
| 387 |
+
<tr>
|
| 388 |
+
<td>LLaVA-PLLuM-12b-nc-250715 (Ours)</td>
|
| 389 |
+
<td>76.73%</td>
|
| 390 |
+
<td>75.23%</td>
|
| 391 |
+
</tr>
|
| 392 |
+
<tr>
|
| 393 |
+
<td>LLaVA-Bielik-11b-v2.6 (Ours)</td>
|
| 394 |
+
<td>78.24%</td>
|
| 395 |
+
<td>77.75%</td>
|
| 396 |
</tr>
|
| 397 |
<tr class="is-selected">
|
| 398 |
<td><strong>LLaVA-PLLuM-12b-nc (Ours)</strong></td>
|
| 399 |
+
<td><strong>79.35%</strong> <span class="tag is-success">+9.55%</span></td>
|
| 400 |
+
<td><strong>78.43%</strong></td>
|
| 401 |
</tr>
|
| 402 |
+
|
| 403 |
<tr class="has-background-light">
|
| 404 |
<td colspan="3" class="has-text-centered">
|
| 405 |
<em>Additional Open-Source Models (different architectures)</em>
|
| 406 |
</td>
|
| 407 |
</tr>
|
| 408 |
+
|
| 409 |
<tr>
|
| 410 |
+
<td>Qwen2.5-VL-7B</td>
|
| 411 |
+
<td>75.56%</td>
|
| 412 |
+
<td>80.62%</td>
|
| 413 |
</tr>
|
| 414 |
<tr>
|
| 415 |
+
<td>PaliGemma2-10B</td>
|
| 416 |
+
<td>78.39%</td>
|
| 417 |
+
<td>80.46%</td>
|
| 418 |
</tr>
|
| 419 |
<tr>
|
| 420 |
+
<td>Pixtral-12B</td>
|
| 421 |
+
<td>82.06%</td>
|
| 422 |
+
<td>84.31%</td>
|
| 423 |
</tr>
|
| 424 |
</tbody>
|
| 425 |
</table>
|
| 426 |
</div>
|
| 427 |
<p>
|
| 428 |
+
<strong>Key Finding:</strong> Our best model achieves <strong>79.35%</strong> on the Polish MMBench
|
| 429 |
+
v1.1 benchmark, representing a <strong>+9.55% improvement</strong> over LLaVA-1.6-Vicuna-13B (69.80%)
|
| 430 |
+
while maintaining strong English performance at 78.43%. This demonstrates improved
|
| 431 |
+
recognition of Polish context and linguistic understanding. When compared to other open-source models,
|
| 432 |
+
LLaVA-PLLuM shows notably better Polish language understanding, outperforming Qwen2.5-VL-7B (75.56%)
|
| 433 |
+
and PaliGemma2-10B (78.39%) on the Polish benchmark.
|
| 434 |
+
</p>
|
| 435 |
</div>
|
| 436 |
|
| 437 |
+
<h3 class="title is-4">Open-ended Generation Evaluation</h3>
|
| 438 |
<div class="content has-text-justified">
|
| 439 |
<p>
|
| 440 |
To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
|
|
|
|
| 442 |
Polish portion of the XM3600 dataset [<a href="#ref-18">18</a>].
|
| 443 |
The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
|
| 444 |
descriptions of images, making it a suitable testbed for generative multimodal performance.
|
| 445 |
+
We benchmarked our models against three competitive open-source vision-language models of different
|
|
|
|
|
|
|
| 446 |
architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
|
| 447 |
+
evaluation. Because no Polish human-annotated standard for caption quality currently exists, we adopted a threefold evaluation strategy:
|
| 448 |
+
(1) Open-source LLM and VLM judges, (2) Closed-source VLM judge, and (3) Human evaluation.
|
| 449 |
+
Please refer to the <a href="https://arxiv.org/pdf/2602.14073" target="_blank">full paper</a> for complete details of the evaluation methodology and results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 450 |
</p>
|
| 451 |
<div class="table-container">
|
| 452 |
<table class="table is-bordered is-striped is-hoverable is-fullwidth">
|
| 453 |
<thead>
|
| 454 |
<tr>
|
| 455 |
+
<th>Model</th>
|
| 456 |
+
<th>LLaVA-PLLuM-12B-nc-250715</th>
|
| 457 |
+
<th>LLaVA-PLLuM-12B-nc</th>
|
| 458 |
+
<th>LLaVA-Bielik-11B-v2.6</th>
|
| 459 |
</tr>
|
| 460 |
</thead>
|
| 461 |
<tbody>
|
| 462 |
<tr>
|
| 463 |
+
<td>LLaVA-1.6-Mistral-7B</td>
|
| 464 |
+
<td>84.91%</td>
|
| 465 |
+
<td><strong>85.81%</strong></td>
|
| 466 |
+
<td>82.35%</td>
|
| 467 |
+
</tr>
|
| 468 |
+
<tr>
|
| 469 |
+
<td>LLaVA-1.6-Vicuna-13B</td>
|
| 470 |
+
<td>63.64%</td>
|
| 471 |
+
<td><strong>66.71%</strong></td>
|
| 472 |
+
<td>60.32%</td>
|
| 473 |
+
</tr>
|
| 474 |
+
<tr>
|
| 475 |
+
<td>PaliGemma2-10B</td>
|
| 476 |
+
<td>77.47%</td>
|
| 477 |
+
<td><strong>77.53%</strong></td>
|
| 478 |
+
<td>74.10%</td>
|
| 479 |
</tr>
|
| 480 |
<tr>
|
| 481 |
+
<td>Pixtral-12B</td>
|
| 482 |
+
<td>43.38%</td>
|
| 483 |
+
<td>48.33%</td>
|
| 484 |
+
<td>40.31%</td>
|
| 485 |
</tr>
|
| 486 |
<tr>
|
| 487 |
+
<td>Qwen2.5-VL-7B</td>
|
| 488 |
+
<td>42.69%</td>
|
| 489 |
+
<td>43.15%</td>
|
| 490 |
+
<td>34.76%</td>
|
| 491 |
</tr>
|
| 492 |
</tbody>
|
| 493 |
+
<tfoot>
|
| 494 |
+
<tr class="has-background-light">
|
| 495 |
+
<td colspan="4" class="has-text-centered">
|
| 496 |
+
<em>Preference rate (%) of our models over baseline judged by LLM (Llama-3.3-70B-Instruct) on XM3600 dataset for linguistic correctness of descriptions.</em>
|
| 497 |
+
</td>
|
| 498 |
+
</tr>
|
| 499 |
+
</tfoot>
|
| 500 |
</table>
|
| 501 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 502 |
</div>
|
| 503 |
</div>
|
| 504 |
</div>
|
|
|
|
| 535 |
machine-translated datasets, without human correction or manual
|
| 536 |
annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language
|
| 537 |
model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of
|
| 538 |
+
Polish cultural context. We show gains of 9.5 percentage points over LLaVA-based baselines on a
|
| 539 |
manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient
|
| 540 |
approach.
|
| 541 |
</p>
|
|
|
|
| 560 |
<ol>
|
| 561 |
<li id="ref-1">
|
| 562 |
PLLuM: A Family of Polish Large Language Models -
|
| 563 |
+
<a href="https://arxiv.org/abs/2511.03823" target="_blank">
|
| 564 |
arXiv:2511.03823
|
| 565 |
</a>
|
| 566 |
</li>
|
| 567 |
<li id="ref-2">
|
| 568 |
PLLuM Model -
|
| 569 |
+
<a href="https://huggingface.co/CYFRAGOVPL/pllum-12b-nc-instruct-250715" target="_blank">
|
| 570 |
Hugging Face
|
| 571 |
</a>
|
| 572 |
</li>
|
| 573 |
<li id="ref-3">
|
| 574 |
LLaVA-NeXT -
|
| 575 |
+
<a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/" target="_blank">
|
| 576 |
Blog Post
|
| 577 |
</a>
|
| 578 |
</li>
|
| 579 |
<li id="ref-4">
|
| 580 |
SigLIP2 -
|
| 581 |
+
<a href="https://arxiv.org/abs/2502.14786" target="_blank">
|
| 582 |
arXiv:2502.14786
|
| 583 |
</a>
|
| 584 |
</li>
|
| 585 |
<li id="ref-5">
|
| 586 |
ALLaVA -
|
| 587 |
+
<a href="https://arxiv.org/abs/2402.11684" target="_blank">
|
| 588 |
arXiv:2402.11684
|
| 589 |
</a>
|
| 590 |
</li>
|
| 591 |
<li id="ref-6">
|
| 592 |
Visual Instruction Tuning (LLaVA) -
|
| 593 |
+
<a href="https://arxiv.org/abs/2304.08485" target="_blank">
|
| 594 |
arXiv:2304.08485
|
| 595 |
</a>
|
| 596 |
</li>
|
| 597 |
<li id="ref-7">
|
| 598 |
Q-Instruct -
|
| 599 |
+
<a href="https://arxiv.org/abs/2311.06783" target="_blank">
|
| 600 |
arXiv:2311.06783
|
| 601 |
</a>
|
| 602 |
</li>
|
| 603 |
<li id="ref-8">
|
| 604 |
LVIS-Instruct4V -
|
| 605 |
+
<a href="https://arxiv.org/abs/2311.07574" target="_blank">
|
| 606 |
arXiv:2311.07574
|
| 607 |
</a>
|
| 608 |
</li>
|
| 609 |
<li id="ref-9">
|
| 610 |
A-OKVQA -
|
| 611 |
+
<a href="https://arxiv.org/abs/2206.01718" target="_blank">
|
| 612 |
arXiv:2206.01718
|
| 613 |
</a>
|
| 614 |
</li>
|
| 615 |
<li id="ref-10">
|
| 616 |
SynthDoG -
|
| 617 |
+
<a href="https://arxiv.org/abs/2111.15664" target="_blank">
|
| 618 |
arXiv:2111.15664
|
| 619 |
</a>
|
| 620 |
</li>
|
| 621 |
<li id="ref-11">
|
| 622 |
MS COCO -
|
| 623 |
+
<a href="https://arxiv.org/abs/1405.0312" target="_blank">
|
| 624 |
arXiv:1405.0312
|
| 625 |
</a>
|
| 626 |
</li>
|
| 627 |
<li id="ref-12">
|
| 628 |
WIT Dataset -
|
| 629 |
+
<a href="https://doi.org/10.1145/3404835.3463257" target="_blank">
|
| 630 |
ACM Digital Library
|
| 631 |
</a>
|
| 632 |
</li>
|
| 633 |
<li id="ref-13">
|
| 634 |
TallyQA -
|
| 635 |
+
<a href="https://arxiv.org/abs/1810.12440" target="_blank">
|
| 636 |
arXiv:1810.12440
|
| 637 |
</a>
|
| 638 |
</li>
|
| 639 |
<li id="ref-14">
|
| 640 |
Tower+ Translation Model -
|
| 641 |
+
<a href="https://huggingface.co/Unbabel/Tower-Plus-72B" target="_blank">
|
| 642 |
Hugging Face
|
| 643 |
</a>
|
| 644 |
</li>
|
| 645 |
<li id="ref-15">
|
| 646 |
COMET Metric -
|
| 647 |
+
<a href="https://unbabel.github.io/COMET/html/index.html" target="_blank">
|
| 648 |
Documentation
|
| 649 |
</a>
|
| 650 |
</li>
|
| 651 |
<li id="ref-16">
|
| 652 |
LLaVA-Pretrain Dataset -
|
| 653 |
+
<a href="https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain" target="_blank">
|
| 654 |
Hugging Face
|
| 655 |
</a>
|
| 656 |
</li>
|
| 657 |
<li id="ref-17">
|
| 658 |
MMBench -
|
| 659 |
+
<a href="https://huggingface.co/spaces/opencompass/open_vlm_leaderboard" target="_blank">
|
| 660 |
OpenCompass Leaderboard
|
| 661 |
</a>
|
| 662 |
</li>
|
| 663 |
<li id="ref-18">
|
| 664 |
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset -
|
| 665 |
+
<a href="https://aclanthology.org/2022.emnlp-main.45/" target="_blank">
|
| 666 |
EMNLP 2022
|
| 667 |
</a>
|
| 668 |
</li>
|
| 669 |
<li id="ref-19">
|
| 670 |
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) -
|
| 671 |
+
<a href="https://arxiv.org/abs/2310.03744" target="_blank">
|
| 672 |
arXiv:2310.03744
|
| 673 |
</a>
|
| 674 |
+
<li id="ref-20">
|
| 675 |
+
Bielik 11B v2 Technical Report -
|
| 676 |
+
<a href="https://arxiv.org/abs/2505.02410" target="_blank">
|
| 677 |
+
arXiv:2505.02410
|
| 678 |
+
</a>
|
| 679 |
+
</li>
|
| 680 |
</ol>
|
| 681 |
</div>
|
| 682 |
</div>
|
|
|
|
| 687 |
<div class="column is-full-width">
|
| 688 |
<h2 class="title is-3">BibTeX</h2>
|
| 689 |
<pre><code>
|
| 690 |
+
@inproceedings{statkiewicz2026annotation,
|
| 691 |
+
title = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
|
| 692 |
+
author = {Statkiewicz, Grzegorz and
|
| 693 |
+
Dobrzeniecka, Alicja and
|
| 694 |
+
Seweryn, Karolina and
|
| 695 |
+
Krasnodębska, Aleksandra and
|
| 696 |
+
Piosek, Karolina and
|
| 697 |
+
Bogusz, Katarzyna and
|
| 698 |
+
Cygert, Sebastian and
|
| 699 |
+
Kusa, Wojciech},
|
| 700 |
+
booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop},
|
| 701 |
+
month = mar,
|
| 702 |
+
year = {2026},
|
| 703 |
+
address = {Rabat, Morocco},
|
| 704 |
+
publisher = {Association for Computational Linguistics}
|
| 705 |
}
|
| 706 |
+
|
| 707 |
</code></pre>
|
| 708 |
</div>
|
| 709 |
</div>
|
|
|
|
| 718 |
<p>
|
| 719 |
This website is adapted from <a href="https://github.com/nerfies/nerfies.github.io" target="_blank">Nerfies</a>, licensed
|
| 720 |
under a
|
| 721 |
+
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike
|
| 722 |
4.0 International License</a>.
|
| 723 |
</p>
|
| 724 |
</div>
|