Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.ipynb_checkpoints/README-checkpoint.md +106 -0
README.md +106 -3
config.json +10 -0
eole-config.yaml +98 -0
eole-model/config.json +132 -0
eole-model/en.spm.model +3 -0
eole-model/model.00.safetensors +3 -0
eole-model/pl.spm.model +3 -0
eole-model/vocab.json +0 -0
model.bin +3 -0
source_vocabulary.json +0 -0
src.spm.model +3 -0
target_vocabulary.json +0 -0
tgt.spm.model +3 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,106 @@

+---
+language:
+- en
+- pl
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.pl-en
+model-index:
+- name: quickmt-pl-en
+  results:
+  - task:
+      name: Translation pol-eng
+      type: translation
+      args: pol-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: ell_Grek eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 27.46
+    - name: CHRF
+      type: chrf
+      value: 57.18
+    - name: COMET
+      type: comet
+      value: 85.04
+---
+# `quickmt-pl-en` Neural Machine Translation Model
+`quickmt-pl-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `pl` into `en`.
+## Try it on our Huggingface Space
+Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 195M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 20k separate Sentencepiece vocabs
+* Expested for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.pl-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-pl-en ./quickmt-pl-en
+```
+Finally use the model in python:
+```python
+from quickmt impest Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-pl-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr Ehud Ur, będący profesorem medycyny na Uniwersytecie Dalhousie w Halifaxie w Nowej Szkocji oraz przewodniczącym oddziału klinicznego i naukowego Kanadyjskiego Stowarzyszenia Cukrzycy, przestrzegł, iż badania nadal dopiero się zaczynają.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chairman of the clinical and scientific division of the Canadian Diabetes Association, warned that research is still just beginning.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Professor of Medicine at Dalhous University Halifax in Nova Scotia, MD and Chair of the Canadian Diabetes Association’s Clinical and Scientific Division, cautioned that research is just beginning.'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("pol_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-pl-en            |  27.46 |   57.18 |     85.04 |       1.46 |
+| Helsinki-NLP/opus-mt-pl-en       |  25.55 |   55.39 |     83.8  |       4.01 |
+| facebook/nllb-200-distilled-600M |  29.28 |   57.11 |     84.65 |      21.61 |
+| facebook/nllb-200-distilled-1.3B |  30.99 |   58.77 |     86.04 |      37.64 |
+| facebook/m2m100_418M             |  22.12 |   52.51 |     80.41 |      17.99 |
+| facebook/m2m100_1.2B             |  27.13 |   56.36 |     84.48 |      35.01 |

README.md CHANGED Viewed

@@ -1,3 +1,106 @@
----
-license: cc-by-4.0
----

+---
+language:
+- en
+- pl
+tags:
+- translation
+license: cc-by-4.0
+datasets:
+- quickmt/quickmt-train.pl-en
+model-index:
+- name: quickmt-pl-en
+  results:
+  - task:
+      name: Translation pol-eng
+      type: translation
+      args: pol-eng
+    dataset:
+      name: flores101-devtest
+      type: flores_101
+      args: ell_Grek eng_Latn devtest
+    metrics:
+    - name: BLEU
+      type: bleu
+      value: 27.46
+    - name: CHRF
+      type: chrf
+      value: 57.18
+    - name: COMET
+      type: comet
+      value: 85.04
+---
+# `quickmt-pl-en` Neural Machine Translation Model
+`quickmt-pl-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `pl` into `en`.
+## Try it on our Huggingface Space
+Give it a try before downloading here: https://huggingface.co/spaces/quickmt/QuickMT-Demo
+## Model Information
+* Trained using [`eole`](https://github.com/eole-nlp/eole)
+* 195M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
+* 20k separate Sentencepiece vocabs
+* Expested for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
+* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.pl-en/tree/main
+See the `eole` model configuration in this repository for further details and the `eole-model` for the raw `eole` (pytorch) model.
+## Usage with `quickmt`
+You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
+Next, install the `quickmt` python library and download the model:
+```bash
+git clone https://github.com/quickmt/quickmt.git
+pip install ./quickmt/
+quickmt-model-download quickmt/quickmt-pl-en ./quickmt-pl-en
+```
+Finally use the model in python:
+```python
+from quickmt impest Translator
+# Auto-detects GPU, set to "cpu" to force CPU inference
+t = Translator("./quickmt-pl-en/", device="auto")
+# Translate - set beam size to 1 for faster speed (but lower quality)
+sample_text = 'Dr Ehud Ur, będący profesorem medycyny na Uniwersytecie Dalhousie w Halifaxie w Nowej Szkocji oraz przewodniczącym oddziału klinicznego i naukowego Kanadyjskiego Stowarzyszenia Cukrzycy, przestrzegł, iż badania nadal dopiero się zaczynają.'
+t(sample_text, beam_size=5)
+```
+> 'Dr. Ehud Ur, a professor of medicine at Dalhousie University in Halifax, Nova Scotia and chairman of the clinical and scientific division of the Canadian Diabetes Association, warned that research is still just beginning.'
+```python
+# Get alternative translations by sampling
+# You can pass any cTranslate2 `translate_batch` arguments
+t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
+```
+> 'Professor of Medicine at Dalhous University Halifax in Nova Scotia, MD and Chair of the Canadian Diabetes Association’s Clinical and Scientific Division, cautioned that research is just beginning.'
+The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible  to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`. A model in safetensors format to be used with `eole` is also provided.
+## Metrics
+`bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("pol_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a larger batch size).
+|                                  |   bleu |   chrf2 |   comet22 |   Time (s) |
+|:---------------------------------|-------:|--------:|----------:|-----------:|
+| quickmt/quickmt-pl-en            |  27.46 |   57.18 |     85.04 |       1.46 |
+| Helsinki-NLP/opus-mt-pl-en       |  25.55 |   55.39 |     83.8  |       4.01 |
+| facebook/nllb-200-distilled-600M |  29.28 |   57.11 |     84.65 |      21.61 |
+| facebook/nllb-200-distilled-1.3B |  30.99 |   58.77 |     86.04 |      37.64 |
+| facebook/m2m100_418M             |  22.12 |   52.51 |     80.41 |      17.99 |
+| facebook/m2m100_1.2B             |  27.13 |   56.36 |     84.48 |      35.01 |

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "multi_query_attention": false,
+  "unk_token": "<unk>"
+}

eole-config.yaml ADDED Viewed

	@@ -0,0 +1,98 @@

+## IO
+save_data: data
+overwrite: True
+seed: 1234
+report_every: 100
+valid_metrics: ["BLEU"]
+tensorboard: true
+tensorboard_log_dir: tensorboard
+### Vocab
+src_vocab: pl.eole.vocab
+tgt_vocab: en.eole.vocab
+src_vocab_size: 20000
+tgt_vocab_size: 20000
+vocab_size_multiple: 8
+share_vocab: false
+n_sample: 0
+data:
+    corpus_1:
+        # path_src: hf://quickmt/quickmt-train.pl-en/pl
+        # path_tgt: hf://quickmt/quickmt-train.pl-en/en
+        # path_sco: hf://quickmt/quickmt-train.pl-en/sco
+        path_src: train.pl
+        path_tgt: train.en
+    valid:
+        path_src: valid.pl
+        path_tgt: valid.en
+transforms: [sentencepiece, filtertoolong]
+transforms_configs:
+  sentencepiece:
+    src_subword_model: "pl.spm.model"
+    tgt_subword_model: "en.spm.model"
+  filtertoolong:
+    src_seq_length: 256
+    tgt_seq_length: 256
+training:
+    # Run configuration
+    model_path: quickmt-pl-en-eole-model
+    #train_from: model
+    keep_checkpoint: 4
+    train_steps: 100000
+    save_checkpoint_steps: 5000
+    valid_steps: 5000
+    # Train on a single GPU
+    world_size: 1
+    gpu_ranks: [0]
+    # Batching 10240
+    batch_type: "tokens"
+    batch_size: 8000
+    valid_batch_size: 4096
+    batch_size_multiple: 8
+    accum_count: [10]
+    accum_steps: [0]
+    # Optimizer & Compute
+    compute_dtype: "fp16"
+    optim: "adamw"
+    #use_amp: False
+    learning_rate: 2.0
+    warmup_steps: 4000
+    decay_method: "noam"
+    adam_beta2: 0.998
+    # Data loading
+    bucket_size: 128000
+    num_workers: 4
+    prefetch_factor: 32
+    # Hyperparams
+    dropout_steps: [0]
+    dropout: [0.1]
+    attention_dropout: [0.1]
+    max_grad_norm: 0
+    label_smoothing: 0.1
+    average_decay: 0.0001
+    param_init_method: xavier_uniform
+    normalization: "tokens"
+model:
+    architecture: "transformer"
+    share_embeddings: false
+    share_decoder_embeddings: false
+    hidden_size: 1024
+    encoder:
+        layers: 8
+    decoder:
+        layers: 2
+    heads: 8
+    transformer_ff: 4096
+    embeddings:
+        word_vec_size: 1024
+        position_encoding_type: "SinusoidalInterleaved"

eole-model/config.json ADDED Viewed

	@@ -0,0 +1,132 @@

+{
+  "share_vocab": false,
+  "n_sample": 0,
+  "tensorboard_log_dir_dated": "tensorboard/Sep-15_20-41-35",
+  "seed": 1234,
+  "src_vocab": "pl.eole.vocab",
+  "valid_metrics": [
+    "BLEU"
+  ],
+  "tgt_vocab": "en.eole.vocab",
+  "vocab_size_multiple": 8,
+  "tensorboard_log_dir": "tensorboard",
+  "tgt_vocab_size": 20000,
+  "tensorboard": true,
+  "save_data": "data",
+  "overwrite": true,
+  "report_every": 100,
+  "src_vocab_size": 20000,
+  "transforms": [
+    "sentencepiece",
+    "filtertoolong"
+  ],
+  "training": {
+    "batch_size_multiple": 8,
+    "average_decay": 0.0001,
+    "compute_dtype": "torch.float16",
+    "save_checkpoint_steps": 5000,
+    "world_size": 1,
+    "accum_count": [
+      10
+    ],
+    "valid_batch_size": 4096,
+    "attention_dropout": [
+      0.1
+    ],
+    "gpu_ranks": [
+      0
+    ],
+    "valid_steps": 5000,
+    "optim": "adamw",
+    "train_steps": 100000,
+    "accum_steps": [
+      0
+    ],
+    "bucket_size": 128000,
+    "prefetch_factor": 32,
+    "dropout": [
+      0.1
+    ],
+    "warmup_steps": 4000,
+    "batch_size": 8000,
+    "normalization": "tokens",
+    "model_path": "quickmt-pl-en-eole-model",
+    "learning_rate": 2.0,
+    "label_smoothing": 0.1,
+    "dropout_steps": [
+      0
+    ],
+    "param_init_method": "xavier_uniform",
+    "decay_method": "noam",
+    "adam_beta2": 0.998,
+    "max_grad_norm": 0.0,
+    "num_workers": 0,
+    "batch_type": "tokens",
+    "keep_checkpoint": 4
+  },
+  "transforms_configs": {
+    "sentencepiece": {
+      "src_subword_model": "${MODEL_PATH}/pl.spm.model",
+      "tgt_subword_model": "${MODEL_PATH}/en.spm.model"
+    },
+    "filtertoolong": {
+      "src_seq_length": 256,
+      "tgt_seq_length": 256
+    }
+  },
+  "model": {
+    "share_embeddings": false,
+    "hidden_size": 1024,
+    "architecture": "transformer",
+    "heads": 8,
+    "position_encoding_type": "SinusoidalInterleaved",
+    "transformer_ff": 4096,
+    "share_decoder_embeddings": false,
+    "decoder": {
+      "hidden_size": 1024,
+      "tgt_word_vec_size": 1024,
+      "layers": 2,
+      "heads": 8,
+      "decoder_type": "transformer",
+      "position_encoding_type": "SinusoidalInterleaved",
+      "transformer_ff": 4096,
+      "n_positions": null
+    },
+    "encoder": {
+      "hidden_size": 1024,
+      "encoder_type": "transformer",
+      "layers": 8,
+      "heads": 8,
+      "position_encoding_type": "SinusoidalInterleaved",
+      "transformer_ff": 4096,
+      "src_word_vec_size": 1024,
+      "n_positions": null
+    },
+    "embeddings": {
+      "position_encoding_type": "SinusoidalInterleaved",
+      "tgt_word_vec_size": 1024,
+      "src_word_vec_size": 1024,
+      "word_vec_size": 1024
+    }
+  },
+  "data": {
+    "corpus_1": {
+      "path_align": null,
+      "path_src": "train.pl",
+      "path_tgt": "train.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ]
+    },
+    "valid": {
+      "path_align": null,
+      "path_src": "valid.pl",
+      "path_tgt": "valid.en",
+      "transforms": [
+        "sentencepiece",
+        "filtertoolong"
+      ]
+    }
+  }
+}

eole-model/en.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83fa4a21ff8359b8827f033977c0e563fd0786ba84bbc93d4a1e22a0dd81ee7f
+size 585058

eole-model/model.00.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d2700218a5e8c84d753274d2be97700d8d086094a8ecd009dbcb91a8b7043917
+size 823882912

eole-model/pl.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6ad874c44613df56fb48be31fb64b6b5fc96a14a7888e802b0f91dbf98a8836
+size 605240

eole-model/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e939c36e6aa38dfa320166a55b3adbfa17bcfd918a560bfb8b3ba6991ac16c2
+size 401699775

source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6ad874c44613df56fb48be31fb64b6b5fc96a14a7888e802b0f91dbf98a8836
+size 605240

target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tgt.spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83fa4a21ff8359b8827f033977c0e563fd0786ba84bbc93d4a1e22a0dd81ee7f
+size 585058