TJ-1.0

TJ-1.0 is the flagship instruction-tuned language model of the TajikGPT platform, developed by SoulLab. It is the first commercially deployed large language model with native support for the Tajik language, offering a balanced combination of quality, speed, and multilingual capability.

Note: TJ-1.0 is available via API only and is not available for download or local deployment.

Model Details

Property	Value
Developer	SoulLab
Model type	Instruction-tuned Causal Language Model
Architecture	Decoder-only Transformer with Grouped Query Attention (GQA)
Positional Encoding	Rotary Position Embedding (RoPE)
Tokenizer	Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin
Fine-tuning	Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback)
Context window	128,000 tokens
Max output tokens	8,192 tokens
Knowledge cutoff	Q3 2024
Languages	Tajik (tg), Russian (ru), English (en), and 50+ languages
License	Proprietary — TajikGPT Terms
Training hardware	NVIDIA A100 80GB, bf16 precision, PyTorch

Training Data

TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.

Source Category	Description	Approx. Share
Tajik Web Corpus	News, blogs, forums, government portals in Tajik (Cyrillic & Latin)	28%
Tajik Literature & Culture	Books, poetry, historical texts, folklore	12%
Tajik Legislation	Laws, decrees, official government documents	8%
Multilingual Web	High-quality filtered web data (Russian, English, and others)	32%
Instruction & Dialogue	Human-written and synthetic instruction-following data	14%
Code	Source code across major programming languages	6%

Total corpus size: ~2 trillion tokens
Data freshness: Content up to Q3 2024
Processing: Deduplication, quality filtering, language identification, PII removal applied to all sources.

Intended Use

Recommended Use Cases

Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
Document summarization and translation
Creative writing, content creation and copywriting
Education, tutoring and homework help
Business communication and professional correspondence
Data analysis, extraction and summarization
Code generation and debugging

Out-of-Scope Use Cases

Generation of illegal, harmful, or deceptive content
Medical diagnosis or legal advice without professional oversight
Surveillance or targeting of individuals
Automated high-stakes decision-making without human review
Any use violating the TajikGPT Terms of Service

Evaluation / Benchmarks

All benchmarks were evaluated using standard few-shot settings unless otherwise noted.

General Benchmarks

Benchmark	Score	# Shots	Metric
MMLU (Massive Multitask Language Understanding)	72.1%	5-shot	Accuracy
MT-Bench (Multi-turn instruction following)	7.1 / 10	0-shot	GPT-4 Judge
HumanEval (Code generation)	58.3%	0-shot	pass@1
HellaSwag (Commonsense reasoning)	81.4%	10-shot	Accuracy

Tajik Language Benchmarks

These are the first published benchmarks for Tajik-language LLM evaluation.

Benchmark	Score	Description
TajikQA	78.4%	Open-domain Q&A in Tajik language
TajikTranslate	81.2% BLEU	Tajik ↔ Russian translation
TajikInstruct	74.6%	Instruction following in Tajik

How to Use

TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.

pip install tajikgpt

Python SDK

from tajikgpt import TajikGPT

client = TajikGPT(api_key="sk-tj-your-key")

response = client.chat.completions.create(
    model="tj-1.0",
    messages=[
        {"role": "system", "content": "Ты полезный помощник."},
        {"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
    ]
)
print(response.choices[0].message.content)

REST API

curl -X POST https://tajikgpt.com/api/tj/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-tj-your-key" \
  -d '{
    "model": "tj-1.0",
    "messages": [
      {"role": "user", "content": "Hello! What can you do?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Limitations

Dialectal Tajik: The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
Hallucinations: Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
Knowledge cutoff: The model has no knowledge of events after Q3 2024.
Mathematical reasoning: Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
Low-resource languages: While 50+ languages are supported, quality varies significantly for lower-resource languages.
Long context degradation: Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.

Responsible AI & Safety

RLHF: The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
Red Teaming: Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
Content Filtering: The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
Bias: Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
Privacy: The training data was processed with PII (personally identifiable information) removal pipelines.

Model Family

Model	Context	Max Output	Specialty	Tier
TJ-1.0 Mini	128K	4,096	Fast & lightweight	Free
TJ-1.0	128K	8,192	Balanced — general purpose	Free
TJ-1.0 Pro	128K	16,384	Advanced + Vision	Plus
TJ-1.0 Ultra	128K	32,768	Top performance	Plus
TJ-Coder	131K	32,768	Code specialist	Free
TJ-Image 1.0	—	—	Text-to-Image	Free

Citation

If you use TJ-1.0 in research or build products on top of it, please cite:

@misc{tajikgpt2024tj10,
  title        = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
  author       = {SoulLab},
  year         = {2024},
  howpublished = {\url{https://tajikgpt.com}},
  note         = {Proprietary model, available via API at https://tajikgpt.com}
}

Built with care for Tajikistan and Central Asia. Developed by SoulLab.

Downloads last month: -; Downloads are not tracked for this model. How to track

TajikGPT-Team
/

tj-1.0