Moi!

I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!

I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. The github repo contains some BLEU and ChrF score plots, but I haven't thoroughly investigated the performance by means of them and am hesitant to claim any particular general translation performance for this version. The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.

The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.

Update 10 September 2025: I've updated the code to the latest version of transformers so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data. Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.

Update 21 November 2025: I added another ~450 sentences and shortened training a little bit to avoid overfitting.

Update 28 November 2025: Add another ~90 sentences. Upon further inspection, and inspired by https://www.youtube.com/watch?v=z64a7USuGX0 I decided to train for much longer, and this version seems to perform quite a bit better than the previous.

See here a minimal example code snippet to get the model up and running: (click)

from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
                                          additional_special_tokens=["gos_Latn"])

def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding='longest',
        truncation=True,
        max_length=120
    )
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

translate("Dit is een testzin om te kijken of de code werkt.")

See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.

Check out the dedicated Huggingface space to try out the model!

Alternatively, another (rather slow, but also free and accessible to everyone) way to try out the model: https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M

The code there is also a minimal example of how to use this model.

Don't hesitate to contact me if anything comes up!

Downloads last month: 92

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Tom9358/nllb-tatoeba-gos-nld-v1

Base model

facebook/nllb-200-distilled-1.3B

Finetuned

(18)

this model

Tom9358
/

nllb-tatoeba-gos-nld-v1

Model tree for Tom9358/nllb-tatoeba-gos-nld-v1

Dataset used to train Tom9358/nllb-tatoeba-gos-nld-v1

Space using Tom9358/nllb-tatoeba-gos-nld-v1 1