Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) cf3eebc verified almaghrabima commited on 4 days ago
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) f6c72b6 verified almaghrabima commited on 4 days ago
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 6479992 verified almaghrabima commited on 6 days ago
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% e4a23ce verified almaghrabima commited on 6 days ago
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 3afc36b verified almaghrabima commited on 6 days ago
Link to public SARFTokenizer-benchmark-eval dataset for reproducibility b776880 verified almaghrabima commited on 7 days ago
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 5d4f9d7 verified almaghrabima commited on 7 days ago
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 15b06b6 verified almaghrabima commited on 7 days ago
Update README: Colab-ready code, benchmark, troubleshooting 579b42f verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab d33a32a verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 9608763 verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 1c23cd4 verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab edcc327 verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 92c3237 verified almaghrabima commited on 7 days ago
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 3c1bd73 verified almaghrabima commited on 7 days ago
Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer) 85df039 verified almaghrabima commited on 8 days ago
Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer) 5785206 verified almaghrabima commited on 8 days ago
Add benchmark_results.json: v0.1 benchmark vs 11 tokenizers 73dc8b5 verified almaghrabima commited on 8 days ago
Add BENCHMARK.md: v0.1 benchmark vs 11 tokenizers 9b3c7ca verified almaghrabima commited on 8 days ago
Upload morfessor_models/morf_map_reverse.json with huggingface_hub 2772bff verified almaghrabima commited on Feb 8
Upload morfessor_models/morf_map.json with huggingface_hub 5fee9e3 verified almaghrabima commited on Feb 8
Upload morfessor_models/morfessor_en.bin with huggingface_hub 77f39b7 verified almaghrabima commited on Feb 8
Upload morfessor_models/morfessor_ar.bin with huggingface_hub 7309795 verified almaghrabima commited on Feb 8
Update benchmark results with new tokenizers (Falcon-H1, ALLaM, Hala, Mistral) 55db6a1 verified almaghrabima commited on Feb 4
Upload benchmark_parallel_results.json with huggingface_hub a88f8a7 verified almaghrabima commited on Feb 4
Upload test_comprehensive_results.json with huggingface_hub 1e8911f verified almaghrabima commited on Feb 4
Upload test_comprehensive_million.py with huggingface_hub c24518d verified almaghrabima commited on Feb 4
Upload benchmark_tiktoken_style.py with huggingface_hub 9770614 verified almaghrabima commited on Feb 4