Add pipeline tag and paper link
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
language:
|
| 4 |
- cs
|
| 5 |
- pl
|
|
@@ -7,6 +6,7 @@ language:
|
|
| 7 |
- sl
|
| 8 |
- en
|
| 9 |
library_name: transformers
|
|
|
|
| 10 |
tags:
|
| 11 |
- translation
|
| 12 |
- mt
|
|
@@ -16,24 +16,26 @@ tags:
|
|
| 16 |
- multilingual
|
| 17 |
- allegro
|
| 18 |
- laniqo
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
# MultiSlav BiDi Models
|
| 22 |
|
| 23 |
-
|
| 24 |
<p align="center">
|
| 25 |
<a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
|
| 26 |
</p>
|
| 27 |
|
| 28 |
-
##
|
| 29 |
|
| 30 |
-
___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
|
| 31 |
Each model is supporting Bi-Directional translation.
|
| 32 |
|
| 33 |
___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
|
| 34 |
|
|
|
|
|
|
|
| 35 |
Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
|
| 36 |
-
Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
|
| 37 |
|
| 38 |
<p align="center">
|
| 39 |
<img src="bi-di.svg">
|
|
@@ -45,7 +47,7 @@ ___BiDi-ces-pol___ is a bi-directional model supporting translation both __form
|
|
| 45 |
|
| 46 |
### Supported languages
|
| 47 |
|
| 48 |
-
To use a ___BiDi___ model, you must provide the target language for translation.
|
| 49 |
Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
|
| 50 |
All accepted directions and their respective tokens are listed below.
|
| 51 |
Note that, for each model only two directions are available.
|
|
@@ -108,7 +110,7 @@ Generated Czech output:
|
|
| 108 |
|
| 109 |
## Training
|
| 110 |
|
| 111 |
-
[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
|
| 112 |
During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
|
| 113 |
Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
|
| 114 |
All training parameters are listed in table below.
|
|
@@ -141,10 +143,10 @@ All training parameters are listed in table below.
|
|
| 141 |
|
| 142 |
## Training corpora
|
| 143 |
|
| 144 |
-
The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
|
| 145 |
-
___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
|
| 146 |
|
| 147 |
-
Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
|
| 148 |
The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
|
| 149 |
|
| 150 |
| **Language pair** | **Number of training examples** |
|
|
@@ -216,7 +218,7 @@ The datasets used (only applicable to specific directions):
|
|
| 216 |
|
| 217 |
## Evaluation
|
| 218 |
|
| 219 |
-
Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
|
| 220 |
The table below compares performance of the open-source models and all applicable models from our collection.
|
| 221 |
Metric used: Unbabel/wmt22-comet-da.
|
| 222 |
|
|
@@ -250,8 +252,6 @@ The model is licensed under CC BY 4.0, which allows for commercial use.
|
|
| 250 |
## Citation
|
| 251 |
TO BE UPDATED SOON 🤗
|
| 252 |
|
| 253 |
-
|
| 254 |
-
|
| 255 |
## Contact Options
|
| 256 |
|
| 257 |
Authors:
|
|
@@ -260,4 +260,4 @@ Authors:
|
|
| 260 |
|
| 261 |
Please don't hesitate to contact authors if you have any questions or suggestions:
|
| 262 |
- e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
|
| 263 |
-
- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- cs
|
| 4 |
- pl
|
|
|
|
| 6 |
- sl
|
| 7 |
- en
|
| 8 |
library_name: transformers
|
| 9 |
+
license: cc-by-4.0
|
| 10 |
tags:
|
| 11 |
- translation
|
| 12 |
- mt
|
|
|
|
| 16 |
- multilingual
|
| 17 |
- allegro
|
| 18 |
- laniqo
|
| 19 |
+
pipeline_tag: translation
|
| 20 |
---
|
| 21 |
|
| 22 |
# MultiSlav BiDi Models
|
| 23 |
|
|
|
|
| 24 |
<p align="center">
|
| 25 |
<a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
|
| 26 |
</p>
|
| 27 |
|
| 28 |
+
## Multilingual BiDi MT Models
|
| 29 |
|
| 30 |
+
___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
|
| 31 |
Each model is supporting Bi-Directional translation.
|
| 32 |
|
| 33 |
___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
|
| 34 |
|
| 35 |
+
Paper: [](https://hf.co/papers/2502.14509).
|
| 36 |
+
|
| 37 |
Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
|
| 38 |
+
Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
|
| 39 |
|
| 40 |
<p align="center">
|
| 41 |
<img src="bi-di.svg">
|
|
|
|
| 47 |
|
| 48 |
### Supported languages
|
| 49 |
|
| 50 |
+
To use a ___BiDi___ model, you must provide the target language for translation.
|
| 51 |
Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
|
| 52 |
All accepted directions and their respective tokens are listed below.
|
| 53 |
Note that, for each model only two directions are available.
|
|
|
|
| 110 |
|
| 111 |
## Training
|
| 112 |
|
| 113 |
+
[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
|
| 114 |
During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
|
| 115 |
Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
|
| 116 |
All training parameters are listed in table below.
|
|
|
|
| 143 |
|
| 144 |
## Training corpora
|
| 145 |
|
| 146 |
+
The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
|
| 147 |
+
___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
|
| 148 |
|
| 149 |
+
Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
|
| 150 |
The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
|
| 151 |
|
| 152 |
| **Language pair** | **Number of training examples** |
|
|
|
|
| 218 |
|
| 219 |
## Evaluation
|
| 220 |
|
| 221 |
+
Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
|
| 222 |
The table below compares performance of the open-source models and all applicable models from our collection.
|
| 223 |
Metric used: Unbabel/wmt22-comet-da.
|
| 224 |
|
|
|
|
| 252 |
## Citation
|
| 253 |
TO BE UPDATED SOON 🤗
|
| 254 |
|
|
|
|
|
|
|
| 255 |
## Contact Options
|
| 256 |
|
| 257 |
Authors:
|
|
|
|
| 260 |
|
| 261 |
Please don't hesitate to contact authors if you have any questions or suggestions:
|
| 262 |
- e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
|
| 263 |
+
- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)
|