TJ-1.0
TJ-1.0 is the flagship instruction-tuned language model of the TajikGPT platform, developed by SoulLab. It is the first commercially deployed large language model with native support for the Tajik language, offering a balanced combination of quality, speed, and multilingual capability.
Note: TJ-1.0 is available via API only and is not available for download or local deployment.
Model Details
| Property | Value |
|---|---|
| Developer | SoulLab |
| Model type | Instruction-tuned Causal Language Model |
| Architecture | Decoder-only Transformer with Grouped Query Attention (GQA) |
| Positional Encoding | Rotary Position Embedding (RoPE) |
| Tokenizer | Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin |
| Fine-tuning | Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) |
| Context window | 128,000 tokens |
| Max output tokens | 8,192 tokens |
| Knowledge cutoff | Q3 2024 |
| Languages | Tajik (tg), Russian (ru), English (en), and 50+ languages |
| License | Proprietary — TajikGPT Terms |
| Training hardware | NVIDIA A100 80GB, bf16 precision, PyTorch |
Training Data
TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.
| Source Category | Description | Approx. Share |
|---|---|---|
| Tajik Web Corpus | News, blogs, forums, government portals in Tajik (Cyrillic & Latin) | 28% |
| Tajik Literature & Culture | Books, poetry, historical texts, folklore | 12% |
| Tajik Legislation | Laws, decrees, official government documents | 8% |
| Multilingual Web | High-quality filtered web data (Russian, English, and others) | 32% |
| Instruction & Dialogue | Human-written and synthetic instruction-following data | 14% |
| Code | Source code across major programming languages | 6% |
Total corpus size: ~2 trillion tokens
Data freshness: Content up to Q3 2024
Processing: Deduplication, quality filtering, language identification, PII removal applied to all sources.
Intended Use
Recommended Use Cases
- Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
- Document summarization and translation
- Creative writing, content creation and copywriting
- Education, tutoring and homework help
- Business communication and professional correspondence
- Data analysis, extraction and summarization
- Code generation and debugging
Out-of-Scope Use Cases
- Generation of illegal, harmful, or deceptive content
- Medical diagnosis or legal advice without professional oversight
- Surveillance or targeting of individuals
- Automated high-stakes decision-making without human review
- Any use violating the TajikGPT Terms of Service
Evaluation / Benchmarks
All benchmarks were evaluated using standard few-shot settings unless otherwise noted.
General Benchmarks
| Benchmark | Score | # Shots | Metric |
|---|---|---|---|
| MMLU (Massive Multitask Language Understanding) | 72.1% | 5-shot | Accuracy |
| MT-Bench (Multi-turn instruction following) | 7.1 / 10 | 0-shot | GPT-4 Judge |
| HumanEval (Code generation) | 58.3% | 0-shot | pass@1 |
| HellaSwag (Commonsense reasoning) | 81.4% | 10-shot | Accuracy |
Tajik Language Benchmarks
These are the first published benchmarks for Tajik-language LLM evaluation.
| Benchmark | Score | Description |
|---|---|---|
| TajikQA | 78.4% | Open-domain Q&A in Tajik language |
| TajikTranslate | 81.2% BLEU | Tajik ↔ Russian translation |
| TajikInstruct | 74.6% | Instruction following in Tajik |
How to Use
TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.
pip install tajikgpt
Python SDK
from tajikgpt import TajikGPT
client = TajikGPT(api_key="sk-tj-your-key")
response = client.chat.completions.create(
model="tj-1.0",
messages=[
{"role": "system", "content": "Ты полезный помощник."},
{"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
]
)
print(response.choices[0].message.content)
REST API
curl -X POST https://tajikgpt.com/api/tj/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tj-your-key" \
-d '{
"model": "tj-1.0",
"messages": [
{"role": "user", "content": "Hello! What can you do?"}
],
"max_tokens": 1024,
"temperature": 0.7
}'
Limitations
- Dialectal Tajik: The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
- Hallucinations: Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
- Knowledge cutoff: The model has no knowledge of events after Q3 2024.
- Mathematical reasoning: Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
- Low-resource languages: While 50+ languages are supported, quality varies significantly for lower-resource languages.
- Long context degradation: Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.
Responsible AI & Safety
- RLHF: The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
- Red Teaming: Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
- Content Filtering: The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
- Bias: Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
- Privacy: The training data was processed with PII (personally identifiable information) removal pipelines.
Model Family
| Model | Context | Max Output | Specialty | Tier |
|---|---|---|---|---|
| TJ-1.0 Mini | 128K | 4,096 | Fast & lightweight | Free |
| TJ-1.0 | 128K | 8,192 | Balanced — general purpose | Free |
| TJ-1.0 Pro | 128K | 16,384 | Advanced + Vision | Plus |
| TJ-1.0 Ultra | 128K | 32,768 | Top performance | Plus |
| TJ-Coder | 131K | 32,768 | Code specialist | Free |
| TJ-Image 1.0 | — | — | Text-to-Image | Free |
Links
- Platform: tajikgpt.com
- API Docs: tajikgpt.com/docs
- Python SDK: pypi.org/project/tajikgpt
- Live Demo: HuggingFace Space
- Developer: SoulLab
Citation
If you use TJ-1.0 in research or build products on top of it, please cite:
@misc{tajikgpt2024tj10,
title = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
author = {SoulLab},
year = {2024},
howpublished = {\url{https://tajikgpt.com}},
note = {Proprietary model, available via API at https://tajikgpt.com}
}
Built with care for Tajikistan and Central Asia. Developed by SoulLab.