almaghrabima
/

SARFTokenizer

@@ -5,10 +5,11 @@ High-performance Arabic tokenizer with morphology and parity awareness. Built wi
 ## Features
 - **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
-- **Fast**: Rust core with Python bindings (~30,000 operations/sec)
-- **Accurate**: 100% roundtrip accuracy on 300,000+ test samples
 - **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
 - **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
 ## Installation
@@ -72,6 +73,52 @@ Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
 - SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
 - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
 ### Reproduce Benchmark Results
 Datasets:
@@ -82,9 +129,14 @@ Datasets:
 # Install dependencies
 pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub
-# Download and run benchmark
-wget https://huggingface.co/almaghrabima/SARFTokenizer/raw/main/benchmark_pypi.py
 python benchmark_pypi.py
 ```
 ## Requirements

 ## Features
 - **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
+- **Fast**: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
+- **Accurate**: 100% roundtrip accuracy on 1,000,000 test samples
 - **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
 - **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
+- **Parallel Processing**: Excellent thread scaling (5x+ speedup with 8 threads)
 ## Installation
 - SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
 - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
+### Throughput Benchmark (1M samples, 680 MB)
+Comparison with tiktoken on 1,000,000 documents:
+| Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads |
+|-----------|----------|-----------|-----------|-----------|
+| **SARFTokenizer** | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | **13.72 MB/s** |
+| tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s |
+| tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s |
+| HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s |
+**Key findings:**
+- **SARFTokenizer outperforms tiktoken at 8 threads** (13.72 MB/s vs 8.47-10.60 MB/s)
+- **Excellent parallel scaling**: 4.4x speedup from 1 to 8 threads
+- tiktoken degrades with more threads (peaks at 4T, drops at 8T)
+### Million-Scale Roundtrip Accuracy
+Tested on 999,999 samples from real-world data:
+| Category | Samples | Success | Accuracy |
+|----------|---------|---------|----------|
+| Arabic | 333,333 | 333,333 | **100.00%** |
+| English | 333,333 | 333,333 | **100.00%** |
+| Mixed | 333,333 | 333,333 | **100.00%** |
+| **TOTAL** | **999,999** | **999,999** | **100.00%** |
+### Edge Case Tests (58/58 Passed)
+All 12 edge case categories pass with 100% success:
+| Category | Tests | Status |
+|----------|-------|--------|
+| Unicode Normalization | 6 | PASS |
+| Zero-Width Characters | 6 | PASS |
+| Unicode Whitespace | 6 | PASS |
+| Grapheme Clusters | 6 | PASS |
+| Apostrophes | 4 | PASS |
+| Dashes | 4 | PASS |
+| Decimal Separators | 3 | PASS |
+| URLs/Emails | 4 | PASS |
+| File Paths | 3 | PASS |
+| Code Identifiers | 4 | PASS |
+| Mixed Scripts/RTL | 6 | PASS |
+| Robustness | 6 | PASS |
 ### Reproduce Benchmark Results
 Datasets:
 # Install dependencies
 pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub
+# Run parity benchmark (vs GPT-4o, Gemma, etc.)
 python benchmark_pypi.py
+# Run throughput benchmark (vs tiktoken)
+python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8
+# Run comprehensive tests (roundtrip + edge cases)
+python test_comprehensive_million.py --samples 1000000 --report
 ```
 ## Requirements