SikhLibrarian / README.md
jsdosanj's picture
Update README.md
969f47f verified

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: AI Sikh Librarian
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: true
hardware: cpu-upgrade
license: cc-by-nc-nd-4.0
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/69a29db42927b6e81519bb2d/8V-N8NjKd-rFtBwZh3dxb.jpeg
short_description: A scholarly research tool for querying the SikhLibrary

📚 AI Sikh Librarian

A scholarly research tool for querying the SikhLibrary — 758 million+ words of Sikh scriptures, history, literature, and research materials.

Built for academic researchers and parchariks who need fast, reliable access to primary sources with proper academic citations.

Built by Bhai Jasvant Singh ਪੰਛੀ in service to Sache Paatshaah.


🌗 Dark / Light Mode

Toggle between dark and light themes using the button in the top-right corner. Preference is saved in the browser across sessions.

Modes

🔍 Research Mode — no LLM, no credits, always available

Signal Technology What it catches
Semantic similarity FAISS + all-MiniLM-L6-v2 Conceptual queries, synonyms, related themes
Exact / lexical match BM25 (bm25s) Scripture titles, ਅੰਗ numbers, proper nouns
Score fusion Reciprocal Rank Fusion Best of both signals
Query enrichment Romanisation → Gurmukhi "waheguru" → ਵਾਹਿਗੁਰੂ
Source diversity Per-source cap (15 max) Prevents any single source dominating
Result diversity MMR reranking (λ=0.7) Diverse passages, not 5 from same ਅੰਗ
Gurbani extraction Regex field parser Gurmukhi · Pronunciation · Translation · Explanation
Language labelling langdetect "Translation (Romanian)", "Explanation (English)"
Relevance display Percentile rank 100% = most relevant in this query (intuitive, higher = better)
Score tiers Percentile labels 🟢 Highly Relevant → ⚪ Peripheral
Gurbani deep links ਅੰਗ → SikhiToTheMax One-click source verification
Citations Chicago Manual of Style 17th ed. Ready for academic papers

📖 Learn Mode — Qwen2.5-72B scholarly Q&A

  • PhD-level theological and historical analysis in English
  • All key terms in Punjabi Unicode (ਨਾਮ ਸਿਮਰਨ, ਸੰਤ-ਸਿਪਾਹੀ — never romanised)
  • Automatic ਅੰਗ hallucination verification — unverified citations struck through
  • Multi-turn conversation with 6-turn history

Corpus Coverage

Category Contents
ਗੁਰਬਾਣੀ ਸ੍ਰੀ ਗੁਰੂ ਗਰੰਥ ਸਾਹਿਬ ਜੀ, ਸ੍ਰੀ ਦਸਮ ਗ੍ਰੰਥ, ਸ੍ਰੀ ਸਰਬਲੋਹ ਗ੍ਰੰਥ, ਚਰਿਤ੍ਰੋਪਾਖਿਆਨ, ਭਾਈ ਨੰਦ ਲਾਲ ਗੋਯਾ
ਗ੍ਰੰਥ ਸੂਰਜ ਪ੍ਰਕਾਸ਼, ਪੰਥ ਪ੍ਰਕਾਸ਼, ਮਹਾਨ ਕੋਸ਼, ਗੁਰਬਿਲਾਸ, ਤਨਖ਼ਾਹਨਾਮਾ
ਸਟੀਕ ਫਰੀਦਕੋਟ ਵਾਲਾ ਟੀਕਾ, ਗੁਰੂ ਗਰੰਥ ਦਰਪਣ, ਦਸਮ ਗ੍ਰੰਥ ਟੀਕਾ (8 vols.)
ਸਾਹਿਤ ਜਨਮ ਸਾਖੀਆਂ, ਜੰਗਨਾਮੇ, ਸ਼ਹੀਦ ਬਿਲਾਸ, ਪੁਰਾਤਨ ਜਨਮ ਸਾਖੀ
ਖੋਜ Sikh Encyclopedia, SikhiWiki Archive, Ten Gurus, Gurdwaras Database

Dataset

jsdosanj/SikhLibrary — 331,771 chunks · 758M+ words · 5 categories: ਗੁਰਬਾਣੀ, ਗ੍ਰੰਥ, ਸਟੀਕ, ਸਾਹਿਤ, ਖੋਜ

Storage bucket — jsdosanj/SikhLibrarian-storage/index_output_v2/

File Size Builder
faiss.index ~3.5 GB build_index_local_v2.py
doc_store.sqlite ~6.5 GB build_index_local_v2.py
meta.json ~180 MB build_index_local_v2.py
bm25.pkl ~200 MB build_bm25_local.py (bm25s edition)

Architecture

query
  └─ sanitise + expand (romanisation map)
       ├─ FAISS vector search  (semantic)  ─┐
       └─ BM25 lexical search  (exact)     ─┴─ RRF fusion
                                                  └─ per-source cap (diversity)
                                                       └─ MMR rerank (diversity)
                                                            └─ percentile calibration
                                                                 └─ structured extraction
                                                                      └─ lang detection + Chicago citations

Performance (CPU Upgrade — 8 vCPU / 32 GB RAM)

  • Research mode: < 500ms per query for 100–300 concurrent users
  • Embed cache: 10,000-entry LRU (~35–50% hit rate at scale)
  • Embed pool: 6 worker processes (GIL-bypassing true parallelism)
  • BM25 index: 200 MB RAM, built once at startup (60s)

Hardware tiers

Feature Free tier (2 vCPU / 16 GB) CPU Upgrade (8 vCPU / 32 GB)
BM25 + FAISS hybrid
BM25 pkl load time ~2s ~2s
RRF + multi-query
Query expansion
Context window + snippets
Score tiers + ਅੰਗ links
Embed cache (10K LRU)
MMR reranking ❌ disabled ✅ enabled
Gradio max threads 8 40
Queue max size 20 100
Comfortable concurrent users ~50 ~300
Set Space secret FREE_TIER=true on free tier. Leave unset on CPU Upgrade.

License

CC BY-NC-ND 4.0creativecommons.org/licenses/by-nc-nd/4.0

Attribution: SikhLibrary Dataset (jsdosanj/SikhLibrary) · ShabadOS · BaniDB · Sikhi.IO · Isher Micro Media · Maintainer: Bhai Jasvant Singh ਪੰਛੀ