TJ-1.0

TJ-1.0 is the flagship instruction-tuned language model of the TajikGPT platform, developed by SoulLab. It is the first commercially deployed large language model with native support for the Tajik language, offering a balanced combination of quality, speed, and multilingual capability.

Note: TJ-1.0 is available via API only and is not available for download or local deployment.


Model Details

Property Value
Developer SoulLab
Model type Instruction-tuned Causal Language Model
Architecture Decoder-only Transformer with Grouped Query Attention (GQA)
Positional Encoding Rotary Position Embedding (RoPE)
Tokenizer Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin
Fine-tuning Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback)
Context window 128,000 tokens
Max output tokens 8,192 tokens
Knowledge cutoff Q3 2024
Languages Tajik (tg), Russian (ru), English (en), and 50+ languages
License Proprietary — TajikGPT Terms
Training hardware NVIDIA A100 80GB, bf16 precision, PyTorch

Training Data

TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.

Source Category Description Approx. Share
Tajik Web Corpus News, blogs, forums, government portals in Tajik (Cyrillic & Latin) 28%
Tajik Literature & Culture Books, poetry, historical texts, folklore 12%
Tajik Legislation Laws, decrees, official government documents 8%
Multilingual Web High-quality filtered web data (Russian, English, and others) 32%
Instruction & Dialogue Human-written and synthetic instruction-following data 14%
Code Source code across major programming languages 6%

Total corpus size: ~2 trillion tokens
Data freshness: Content up to Q3 2024
Processing: Deduplication, quality filtering, language identification, PII removal applied to all sources.


Intended Use

Recommended Use Cases

  • Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
  • Document summarization and translation
  • Creative writing, content creation and copywriting
  • Education, tutoring and homework help
  • Business communication and professional correspondence
  • Data analysis, extraction and summarization
  • Code generation and debugging

Out-of-Scope Use Cases

  • Generation of illegal, harmful, or deceptive content
  • Medical diagnosis or legal advice without professional oversight
  • Surveillance or targeting of individuals
  • Automated high-stakes decision-making without human review
  • Any use violating the TajikGPT Terms of Service

Evaluation / Benchmarks

All benchmarks were evaluated using standard few-shot settings unless otherwise noted.

General Benchmarks

Benchmark Score # Shots Metric
MMLU (Massive Multitask Language Understanding) 72.1% 5-shot Accuracy
MT-Bench (Multi-turn instruction following) 7.1 / 10 0-shot GPT-4 Judge
HumanEval (Code generation) 58.3% 0-shot pass@1
HellaSwag (Commonsense reasoning) 81.4% 10-shot Accuracy

Tajik Language Benchmarks

These are the first published benchmarks for Tajik-language LLM evaluation.

Benchmark Score Description
TajikQA 78.4% Open-domain Q&A in Tajik language
TajikTranslate 81.2% BLEU Tajik ↔ Russian translation
TajikInstruct 74.6% Instruction following in Tajik

How to Use

TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.

pip install tajikgpt

Python SDK

from tajikgpt import TajikGPT

client = TajikGPT(api_key="sk-tj-your-key")

response = client.chat.completions.create(
    model="tj-1.0",
    messages=[
        {"role": "system", "content": "Ты полезный помощник."},
        {"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
    ]
)
print(response.choices[0].message.content)

REST API

curl -X POST https://tajikgpt.com/api/tj/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-tj-your-key" \
  -d '{
    "model": "tj-1.0",
    "messages": [
      {"role": "user", "content": "Hello! What can you do?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Limitations

  1. Dialectal Tajik: The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
  2. Hallucinations: Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
  3. Knowledge cutoff: The model has no knowledge of events after Q3 2024.
  4. Mathematical reasoning: Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
  5. Low-resource languages: While 50+ languages are supported, quality varies significantly for lower-resource languages.
  6. Long context degradation: Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.

Responsible AI & Safety

  • RLHF: The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
  • Red Teaming: Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
  • Content Filtering: The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
  • Bias: Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
  • Privacy: The training data was processed with PII (personally identifiable information) removal pipelines.

Model Family

Model Context Max Output Specialty Tier
TJ-1.0 Mini 128K 4,096 Fast & lightweight Free
TJ-1.0 128K 8,192 Balanced — general purpose Free
TJ-1.0 Pro 128K 16,384 Advanced + Vision Plus
TJ-1.0 Ultra 128K 32,768 Top performance Plus
TJ-Coder 131K 32,768 Code specialist Free
TJ-Image 1.0 Text-to-Image Free

Links


Citation

If you use TJ-1.0 in research or build products on top of it, please cite:

@misc{tajikgpt2024tj10,
  title        = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
  author       = {SoulLab},
  year         = {2024},
  howpublished = {\url{https://tajikgpt.com}},
  note         = {Proprietary model, available via API at https://tajikgpt.com}
}

Built with care for Tajikistan and Central Asia. Developed by SoulLab.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support