Commit History

Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
cf3eebc
verified

almaghrabima commited on

Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
f6c72b6
verified

almaghrabima commited on

Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
6479992
verified

almaghrabima commited on

Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
e4a23ce
verified

almaghrabima commited on

Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
3afc36b
verified

almaghrabima commited on

Update README.md
9804b73
verified

almaghrabima commited on

Link to public SARFTokenizer-benchmark-eval dataset for reproducibility
b776880
verified

almaghrabima commited on

README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships
5d4f9d7
verified

almaghrabima commited on

README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships
15b06b6
verified

almaghrabima commited on

Update README.md
75ff490
verified

almaghrabima commited on

Update README.md
6464c31
verified

almaghrabima commited on

Update README: Colab-ready code, benchmark, troubleshooting
579b42f
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
d33a32a
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
9608763
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
1c23cd4
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
edcc327
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
92c3237
verified

almaghrabima commited on

v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
3c1bd73
verified

almaghrabima commited on

Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer)
85df039
verified

almaghrabima commited on

Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer)
5785206
verified

almaghrabima commited on

Add benchmark_results.json: v0.1 benchmark vs 11 tokenizers
73dc8b5
verified

almaghrabima commited on

Add BENCHMARK.md: v0.1 benchmark vs 11 tokenizers
9b3c7ca
verified

almaghrabima commited on

Clean up stale files from prior repo state
79f3761
verified

almaghrabima commited on

v0.1: initial SARFTokenizer release
54c90ec
verified

almaghrabima commited on

Delete morfessor_models
f2c4997
verified

almaghrabima commited on

Delete lexicons
16e88db
verified

almaghrabima commited on

Upload lexicons/lexicon_en.txt with huggingface_hub
eef9b5a
verified

almaghrabima commited on

Upload lexicons/lexicon_ar.txt with huggingface_hub
453d9e0
verified

almaghrabima commited on

Upload morfessor_models/morf_map_reverse.json with huggingface_hub
2772bff
verified

almaghrabima commited on

Upload morfessor_models/morf_map.json with huggingface_hub
5fee9e3
verified

almaghrabima commited on

Upload morfessor_models/morfessor_en.bin with huggingface_hub
77f39b7
verified

almaghrabima commited on

Upload morfessor_models/morfessor_ar.bin with huggingface_hub
7309795
verified

almaghrabima commited on

Update README.md
c94c9a8
verified

almaghrabima commited on

Upload benchmark_pypi_full.py with huggingface_hub
4e2b74e
verified

almaghrabima commited on

Delete benchmark_pypi.py with huggingface_hub
e86ee8e
verified

almaghrabima commited on

Upload tokenizer_config.json with huggingface_hub
dfd4850
verified

almaghrabima commited on

Upload special_tokens_map.json with huggingface_hub
f6b77da
verified

almaghrabima commited on

Upload tokenizer.json with huggingface_hub
a5801cf
verified

almaghrabima commited on

Update README.md
344b20c
verified

almaghrabima commited on

Update README.md
2d1f910
verified

almaghrabima commited on

Update README.md
d3c23ec
verified

almaghrabima commited on

Update benchmark results with new tokenizers (Falcon-H1, ALLaM, Hala, Mistral)
55db6a1
verified

almaghrabima commited on

Upload benchmark_parallel_results.json with huggingface_hub
a88f8a7
verified

almaghrabima commited on

Upload test_comprehensive_results.json with huggingface_hub
1e8911f
verified

almaghrabima commited on

Upload BENCHMARK_REPORT.md with huggingface_hub
57f81bb
verified

almaghrabima commited on

Upload test_comprehensive_million.py with huggingface_hub
c24518d
verified

almaghrabima commited on

Upload benchmark_tiktoken_style.py with huggingface_hub
9770614
verified

almaghrabima commited on

Upload README.md with huggingface_hub
e180732
verified

almaghrabima commited on

Upload README.md with huggingface_hub
37ba042
verified

almaghrabima commited on

Upload README.md with huggingface_hub
6aa17b2
verified

almaghrabima commited on