Thanks for the great model
I have been testing this model recently and I've found that it is maybe the best open-source model available for fluent Finnish language generation when compared to Poro 2, EuroLLM 22B Preview, and Gemma 3 27B. I've tested it with English->Finnish and French->Finnish translation of fiction. The other models, while making clear mistakes, also produce a lot of unnatural or anglicism-full language, which seems to be almost completely absent from this model. With some texts this model is rivalling large commercial models like GPT-5. Excellent work!
The only issue is that currently the translation quality is not constant. One generation may be extremely good, then the next one with the same source text might contain broken grammar. Setting temperature to 0 did not help, instead it introduced new problems. I'll have to test if few-shot prompting can fix these issues, otherwise a translation-focused finetune is likely required.
Below is an example translation for those interested.
Prompt:ENGLISH:
After leaving Vienna, and long before you come to Budapest, the Danube enters a region of singular loneliness and desolation, where its waters spread away on all sides regardless of a main channel, and the country becomes a swamp for miles upon miles, covered by a vast sea of low willow-bushes. On the big maps this deserted area is painted in a fluffy blue, growing fainter in color as it leaves the banks, and across it may be seen in large straggling letters the word Sumpfe, meaning marshes.
In high flood this great acreage of sand, shingle-beds, and willow-grown islands is almost topped by the water, but in normal seasons the bushes bend and rustle in the free winds, showing their silver leaves to the sunshine in an ever-moving plain of bewildering beauty. These willows never attain to the dignity of trees; they have no rigid trunks; they remain humble bushes, with rounded tops and soft outline, swaying on slender stems that answer to the least pressure of the wind; supple as grasses, and so continually shifting that they somehow give the impression that the entire plain is moving and alive. For the wind sends waves rising and falling over the whole surface, waves of leaves instead of waves of water, green swells like the sea, too, until the branches turn and lift, and then silvery white as their underside turns to the sun.
Happy to slip beyond the control of the stern banks, the Danube here wanders about at will among the intricate network of channels intersecting the islands everywhere with broad avenues down which the waters pour with a shouting sound; making whirlpools, eddies, and foaming rapids; tearing at the sandy banks; carrying away masses of shore and willow-clumps; and forming new islands innumerably which shift daily in size and shape and possess at best an impermanent life, since the flood-time obliterates their very existence.
FINNISH (OFFICIAL TRANSLATION):
Completion:Wienistä lähdettyään ja ennen Budapestiin tuloa Tonava kulkee pitkän matkaa yksinäistä ja autiota seutua, jossa sen vedet hajaantuvat joka suunnalle välittämättä pääuomasta ja jossa maa muuttuu monien mailien alalla rämeeksi, jota peittää laaja pajukko. Suurissa kartoissa tämä autio seutu on väritetty pehmeällä sinisellä, joka vaalenee kohti rantoja, ja sen yli on kirjoitettu isoilla harvoilla kirjaimilla sana Sumpfe, joka merkitsee rämettä.
Kun tulva on korkeimmillaan, tämä suuri hiekan, soran ja pajukkoisten saarten alue on melkein kokonaan veden peitossa, mutta normaaleina aikoina pajut kumartuvat ja kahisevat vapaassa tuulessa, näyttäen hopeanhohtoiset lehtensä auringolle alati liikkuvana, hämmästyttävän kauniina tasankona. Nämä pajut eivät koskaan saavuta puun arvokkuutta; niissä ei ole jäykkiä runkoja; ne jäävät nöyriksi pensaiksi, joiden pyöreät latvat ja pehmeä muoto huojuvat hoikilla varsilla, jotka reagoivat pienimpäänkin tuulen henkäykseen; ne ovat joustavia kuin ruohot ja niin jatkuvassa liikkeessä, että ne antavat jotenkin sen vaikutelman, että koko tasanko liikkuu ja elää. Tuuli nimittäin saa lehtien aallot kohoamaan ja laskemaan koko tasangolla, lehtiaaltoja siinä missä vesiaaltojakin, ja vihreitä aaltoja kuin meren, kunnes oksat kääntyvät ja nousevat, ja silloin niiden alapinta kääntyy aurinkoon päin, hopeisen valkoisena.
Onnellinen päästyään ankarien rantojen otteesta, Tonava vaeltaa täällä vapaasti mielihalujensa mukaan saarien välisen mutkikkaan kanavien verkoston halki, joka on täynnä leveitä katuja, joita pitkin vedet virtaavat äänekkäästi; se synnyttää pyörteitä, virtauksia ja kuohuvia koskia; raatelee hiekkarantoja ja pajuja; vie mukanaan suuria määriä rantoja ja pajuja; ja muodostaa lukemattomia uusia saaria, jotka muuttuvat päivittäin kooltaan ja muodoltaan ja joilla on korkeintaan väliaikainen elämä, sillä tulva-aika pyyhkii pois niiden olemassaolon.
Source: https://en.wikisource.org/wiki/The_Willows_(Blackwood)/I
Hey! I am really happy to hear!
Currently, we haven't adapted the model for translation yet! It has only seen some 5% of the training data as parallel texts.
The data format during the pretraining differed from what you use, so a few-shot prompting may help.
We will publish a dedicated translation model as soon as it is ready.
Just finished all layer Instruction Q and A, RAG gentle SFT without adapters, but used flash attention 2 and zero3.
BF16, 24 effective batch size, 4096 seq, grad clipping 1, decay 0.01, 2e-5 with cosine 3% warmup.
Since SFT is delicate work, my quest of finding flat minima was about emphasis on regularization and being a bit noisy.
Wrote custom collator for proper user query masking for ChatML and introduced additional special tokens to mark template.
Quick XLM-RoBERTa-large bench against euroblocks instruction set with Chinese and English removed(this checkpoint did not see that part of set):
Qwen2.5-32B-Instruct: ROUGE-L=0.1842, BERTScore=0.7236
gemma-3-27b-it: ROUGE-L=0.1798, BERTScore=0.7049
EuroLLM-22B-Instruct-Preview: ROUGE-L=0.0719, BERTScore=0.6904
checkpoint-final(Tilde-Open): ROUGE-L=0.2792, BERTScore=0.7502
My trained TildeOpen checkpoint in comparison is less verbose, approx 1:1 from reference, like gemma.
However Qwen and EuroLLM are like 1:3 - in prod such output is undesirable. They burn compute and time.
On a side note EuroLLM-22B-Instruct-Preview bad performance was a bit surprise here, since bench was run on set, that was partly this models training set - it was only model that saw benchmark data in training. So it should excel here, but it failed.
Will run bigger bench set and publish.
Trained on shuffled set with static seed - tried to craft really balanced training set to not damage base model:
======================================================================
📊 Individual Dataset Statistics:
Dataset Examples Sampling Avg Tokens Total Tokens
UltraChat 41,573 20% 1437 59,724,145
Alpaca 10,352 20% 246 2,545,857
Latvian Wikipedia QA 22,262 20% (filtered) 748 16,654,959
EuroBlocks SFT Synthetic 85,071 20% 689 58,622,426
RAGBench (all subsets) 21,982 30% 1199 26,359,165
📈 [SUMMARY] Token Count Breakdown:
Combined total (all datasets): 163,906,552 tokens (163.91M)
Eval split removed (0.5%): 819,533 tokens (0.82M)
Training tokens (after eval): 163,087,019 tokens
🎯 TRAINING SET: 163.09M tokens
🌍 [LANGUAGE DISTRIBUTION] Token Breakdown by Language:
EN (English ) 117,666,791 tokens (71.79%)
LV (Latvian ) 16,654,959 tokens (10.16%)
ZH (Chinese ) 10,098,948 tokens ( 6.16%)
PT (Portuguese (Portugal)) 3,009,301 tokens ( 1.84%)
IT (Italian ) 2,332,476 tokens ( 1.42%)
ES (Spanish ) 2,057,867 tokens ( 1.26%)
HI (Hindi ) 1,968,372 tokens ( 1.20%)
FR (French ) 1,842,637 tokens ( 1.12%)
DE (German ) 1,367,261 tokens ( 0.83%)
NL (Dutch ) 1,059,474 tokens ( 0.65%)
JA (Japanese ) 886,780 tokens ( 0.54%)
UK (Ukrainian ) 780,526 tokens ( 0.48%)
SV (Swedish ) 664,574 tokens ( 0.41%)
HU (Hungarian ) 561,298 tokens ( 0.34%)
PL (Polish ) 560,617 tokens ( 0.34%)
CS (Czech ) 550,494 tokens ( 0.34%)
RU (Russian ) 544,199 tokens ( 0.33%)
KO (Korean ) 339,178 tokens ( 0.21%)
RO (Romanian ) 324,630 tokens ( 0.20%)
FI (Finnish ) 215,229 tokens ( 0.13%)
EL (Greek ) 140,537 tokens ( 0.09%)
SK (Slovak ) 115,015 tokens ( 0.07%)
NO (Norwegian ) 110,252 tokens ( 0.07%)
SL (Slovenian ) 32,326 tokens ( 0.02%)
ET (Estonian ) 22,799 tokens ( 0.01%)
Total languages: 25
Thanks for Your effort!
Any suggestions?