allegro
/

BiDi-eng-pol

@@ -1,5 +1,4 @@
 ---
-license: cc-by-4.0
 language:
 - cs
 - pl
@@ -7,6 +6,7 @@ language:
 - sl
 - en
 library_name: transformers
 tags:
 - translation
 - mt
@@ -16,24 +16,26 @@ tags:
 - multilingual
 - allegro
 - laniqo
 ---
 # MultiSlav BiDi Models
 <p align="center">
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
-##  Multilingual BiDi MT Models
-___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
 Each model is supporting Bi-Directional translation.
 ___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
 Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
-Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
 <p align="center">
   <img src="bi-di.svg">
@@ -45,7 +47,7 @@ ___BiDi-ces-pol___ is a bi-directional model supporting translation both __form
 ### Supported languages
-To use a ___BiDi___ model, you must provide the target language for translation.
 Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
 All accepted directions and their respective tokens are listed below.
 Note that, for each model only two directions are available.
@@ -108,7 +110,7 @@ Generated Czech output:
 ## Training
-[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
 During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
@@ -141,10 +143,10 @@ All training parameters are listed in table below.
 ## Training corpora
-The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
-___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
-Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
 The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
 | **Language pair** | **Number of training examples** |
@@ -216,7 +218,7 @@ The datasets used (only applicable to specific directions):
 ## Evaluation
-Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
 The table below compares performance of the open-source models and all applicable models from our collection.
 Metric used: Unbabel/wmt22-comet-da.
@@ -250,8 +252,6 @@ The model is licensed under CC BY 4.0, which allows for commercial use.
 ## Citation
 TO BE UPDATED SOON 🤗
 ## Contact Options
 Authors:
@@ -260,4 +260,4 @@ Authors:
 Please don't hesitate to contact authors if you have any questions or suggestions:
 - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
-- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)

 ---
 language:
 - cs
 - pl
 - sl
 - en
 library_name: transformers
+license: cc-by-4.0
 tags:
 - translation
 - mt
 - multilingual
 - allegro
 - laniqo
+pipeline_tag: translation
 ---
 # MultiSlav BiDi Models
 <p align="center">
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
+## Multilingual BiDi MT Models
+___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
 Each model is supporting Bi-Directional translation.
 ___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
+Paper: [](https://hf.co/papers/2502.14509).
 Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
+Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
 <p align="center">
   <img src="bi-di.svg">
 ### Supported languages
+To use a ___BiDi___ model, you must provide the target language for translation.
 Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
 All accepted directions and their respective tokens are listed below.
 Note that, for each model only two directions are available.
 ## Training
+[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
 During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
 ## Training corpora
+The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
+___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
+Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
 The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
 | **Language pair** | **Number of training examples** |
 ## Evaluation
+Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
 The table below compares performance of the open-source models and all applicable models from our collection.
 Metric used: Unbabel/wmt22-comet-da.
 ## Citation
 TO BE UPDATED SOON 🤗
 ## Contact Options
 Authors:
 Please don't hesitate to contact authors if you have any questions or suggestions:
 - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
+- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)