--- language: eo license: mit --- # EsperBERTo: A RoBERTa-like model for Esperanto This is a RoBERTa-like model trained from scratch on the Esperanto language. ## Model description The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus. - **Model:** RoBERTa-like - **Layers:** 6 - **Hidden size:** 768 - **Heads:** 12 - **Parameters:** 84M - **Tokenizer:** Byte-level BPE - **Vocabulary size:** 52,000 ## Training data The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size. ## Training procedure The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU. ### Hyperparameters - `output_dir`: "./EsperBERTo" - `overwrite_output_dir`: `True` - `num_train_epochs`: 1 - `per_gpu_train_batch_size`: 64 - `save_steps`: 10_000 - `save_total_limit`: 2 - `prediction_loss_only`: `True` The final training loss was `6.1178`. ## Evaluation results The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline. Example 1: ```python from transformers import pipeline fill_mask = pipeline( "fill-mask", model="./EsperBERTo", tokenizer="./EsperBERTo" ) fill_mask("La suno .") ``` Output: ``` [{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'}, {'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'}, {'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'}, {'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'}, {'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}] ``` Example 2: ```python fill_mask("Jen la komenco de bela .") ``` Output: ``` [{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'}, {'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'}, {'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'}, {'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'}, {'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}] ``` ## Intended uses & limitations This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as: - Text Classification - Token Classification (Part-of-Speech Tagging, Named Entity Recognition) - Question Answering Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended.