Add pipeline tag and paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - cs
5
  - pl
@@ -7,6 +6,7 @@ language:
7
  - sl
8
  - en
9
  library_name: transformers
 
10
  tags:
11
  - translation
12
  - mt
@@ -16,24 +16,26 @@ tags:
16
  - multilingual
17
  - allegro
18
  - laniqo
 
19
  ---
20
 
21
  # MultiSlav BiDi Models
22
 
23
-
24
  <p align="center">
25
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
26
  </p>
27
 
28
- ## Multilingual BiDi MT Models
29
 
30
- ___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
31
  Each model is supporting Bi-Directional translation.
32
 
33
  ___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
34
 
 
 
35
  Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
36
- Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
37
 
38
  <p align="center">
39
  <img src="bi-di.svg">
@@ -45,7 +47,7 @@ ___BiDi-ces-pol___ is a bi-directional model supporting translation both __form
45
 
46
  ### Supported languages
47
 
48
- To use a ___BiDi___ model, you must provide the target language for translation.
49
  Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
50
  All accepted directions and their respective tokens are listed below.
51
  Note that, for each model only two directions are available.
@@ -108,7 +110,7 @@ Generated Czech output:
108
 
109
  ## Training
110
 
111
- [SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
112
  During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
113
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
114
  All training parameters are listed in table below.
@@ -141,10 +143,10 @@ All training parameters are listed in table below.
141
 
142
  ## Training corpora
143
 
144
- The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
145
- ___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
146
 
147
- Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
148
  The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
149
 
150
  | **Language pair** | **Number of training examples** |
@@ -216,7 +218,7 @@ The datasets used (only applicable to specific directions):
216
 
217
  ## Evaluation
218
 
219
- Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
220
  The table below compares performance of the open-source models and all applicable models from our collection.
221
  Metric used: Unbabel/wmt22-comet-da.
222
 
@@ -250,8 +252,6 @@ The model is licensed under CC BY 4.0, which allows for commercial use.
250
  ## Citation
251
  TO BE UPDATED SOON 🤗
252
 
253
-
254
-
255
  ## Contact Options
256
 
257
  Authors:
@@ -260,4 +260,4 @@ Authors:
260
 
261
  Please don't hesitate to contact authors if you have any questions or suggestions:
262
  - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
263
- - LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)
 
1
  ---
 
2
  language:
3
  - cs
4
  - pl
 
6
  - sl
7
  - en
8
  library_name: transformers
9
+ license: cc-by-4.0
10
  tags:
11
  - translation
12
  - mt
 
16
  - multilingual
17
  - allegro
18
  - laniqo
19
+ pipeline_tag: translation
20
  ---
21
 
22
  # MultiSlav BiDi Models
23
 
 
24
  <p align="center">
25
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
26
  </p>
27
 
28
+ ## Multilingual BiDi MT Models
29
 
30
+ ___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
31
  Each model is supporting Bi-Directional translation.
32
 
33
  ___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). More information will be available soon in our upcoming MultiSlav paper.
34
 
35
+ Paper: [](https://hf.co/papers/2502.14509).
36
+
37
  Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
38
+ Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
39
 
40
  <p align="center">
41
  <img src="bi-di.svg">
 
47
 
48
  ### Supported languages
49
 
50
+ To use a ___BiDi___ model, you must provide the target language for translation.
51
  Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
52
  All accepted directions and their respective tokens are listed below.
53
  Note that, for each model only two directions are available.
 
110
 
111
  ## Training
112
 
113
+ [SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
114
  During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
115
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
116
  All training parameters are listed in table below.
 
143
 
144
  ## Training corpora
145
 
146
+ The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
147
+ ___BiDi___ models are our baseline before expanding the data-regime by using higher-level multilinguality.
148
 
149
+ Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library.
150
  The number of total examples post filtering and deduplication varies, depending on languages supported, see the table below.
151
 
152
  | **Language pair** | **Number of training examples** |
 
218
 
219
  ## Evaluation
220
 
221
+ Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
222
  The table below compares performance of the open-source models and all applicable models from our collection.
223
  Metric used: Unbabel/wmt22-comet-da.
224
 
 
252
  ## Citation
253
  TO BE UPDATED SOON 🤗
254
 
 
 
255
  ## Contact Options
256
 
257
  Authors:
 
260
 
261
  Please don't hesitate to contact authors if you have any questions or suggestions:
262
  - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
263
+ - LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)