Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.15.2
title: AI Sikh Librarian
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: true
hardware: cpu-upgrade
license: cc-by-nc-nd-4.0
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/69a29db42927b6e81519bb2d/8V-N8NjKd-rFtBwZh3dxb.jpeg
short_description: A scholarly research tool for querying the SikhLibrary
📚 AI Sikh Librarian
A scholarly research tool for querying the SikhLibrary — 758 million+ words of Sikh scriptures, history, literature, and research materials.
Built for academic researchers and parchariks who need fast, reliable access to primary sources with proper academic citations.
Built by Bhai Jasvant Singh ਪੰਛੀ in service to Sache Paatshaah.
🌗 Dark / Light Mode
Toggle between dark and light themes using the button in the top-right corner. Preference is saved in the browser across sessions.
Modes
🔍 Research Mode — no LLM, no credits, always available
| Signal | Technology | What it catches |
|---|---|---|
| Semantic similarity | FAISS + all-MiniLM-L6-v2 | Conceptual queries, synonyms, related themes |
| Exact / lexical match | BM25 (bm25s) | Scripture titles, ਅੰਗ numbers, proper nouns |
| Score fusion | Reciprocal Rank Fusion | Best of both signals |
| Query enrichment | Romanisation → Gurmukhi | "waheguru" → ਵਾਹਿਗੁਰੂ |
| Source diversity | Per-source cap (15 max) | Prevents any single source dominating |
| Result diversity | MMR reranking (λ=0.7) | Diverse passages, not 5 from same ਅੰਗ |
| Gurbani extraction | Regex field parser | Gurmukhi · Pronunciation · Translation · Explanation |
| Language labelling | langdetect | "Translation (Romanian)", "Explanation (English)" |
| Relevance display | Percentile rank | 100% = most relevant in this query (intuitive, higher = better) |
| Score tiers | Percentile labels | 🟢 Highly Relevant → ⚪ Peripheral |
| Gurbani deep links | ਅੰਗ → SikhiToTheMax | One-click source verification |
| Citations | Chicago Manual of Style 17th ed. | Ready for academic papers |
📖 Learn Mode — Qwen2.5-72B scholarly Q&A
- PhD-level theological and historical analysis in English
- All key terms in Punjabi Unicode (ਨਾਮ ਸਿਮਰਨ, ਸੰਤ-ਸਿਪਾਹੀ — never romanised)
- Automatic ਅੰਗ hallucination verification — unverified citations struck through
- Multi-turn conversation with 6-turn history
Corpus Coverage
| Category | Contents |
|---|---|
| ਗੁਰਬਾਣੀ | ਸ੍ਰੀ ਗੁਰੂ ਗਰੰਥ ਸਾਹਿਬ ਜੀ, ਸ੍ਰੀ ਦਸਮ ਗ੍ਰੰਥ, ਸ੍ਰੀ ਸਰਬਲੋਹ ਗ੍ਰੰਥ, ਚਰਿਤ੍ਰੋਪਾਖਿਆਨ, ਭਾਈ ਨੰਦ ਲਾਲ ਗੋਯਾ |
| ਗ੍ਰੰਥ | ਸੂਰਜ ਪ੍ਰਕਾਸ਼, ਪੰਥ ਪ੍ਰਕਾਸ਼, ਮਹਾਨ ਕੋਸ਼, ਗੁਰਬਿਲਾਸ, ਤਨਖ਼ਾਹਨਾਮਾ |
| ਸਟੀਕ | ਫਰੀਦਕੋਟ ਵਾਲਾ ਟੀਕਾ, ਗੁਰੂ ਗਰੰਥ ਦਰਪਣ, ਦਸਮ ਗ੍ਰੰਥ ਟੀਕਾ (8 vols.) |
| ਸਾਹਿਤ | ਜਨਮ ਸਾਖੀਆਂ, ਜੰਗਨਾਮੇ, ਸ਼ਹੀਦ ਬਿਲਾਸ, ਪੁਰਾਤਨ ਜਨਮ ਸਾਖੀ |
| ਖੋਜ | Sikh Encyclopedia, SikhiWiki Archive, Ten Gurus, Gurdwaras Database |
Dataset
jsdosanj/SikhLibrary
— 331,771 chunks · 758M+ words · 5 categories: ਗੁਰਬਾਣੀ, ਗ੍ਰੰਥ, ਸਟੀਕ, ਸਾਹਿਤ, ਖੋਜ
Storage bucket — jsdosanj/SikhLibrarian-storage/index_output_v2/
| File | Size | Builder |
|---|---|---|
faiss.index |
~3.5 GB | build_index_local_v2.py |
doc_store.sqlite |
~6.5 GB | build_index_local_v2.py |
meta.json |
~180 MB | build_index_local_v2.py |
bm25.pkl |
~200 MB | build_bm25_local.py (bm25s edition) |
Architecture
query
└─ sanitise + expand (romanisation map)
├─ FAISS vector search (semantic) ─┐
└─ BM25 lexical search (exact) ─┴─ RRF fusion
└─ per-source cap (diversity)
└─ MMR rerank (diversity)
└─ percentile calibration
└─ structured extraction
└─ lang detection + Chicago citations
Performance (CPU Upgrade — 8 vCPU / 32 GB RAM)
- Research mode: < 500ms per query for 100–300 concurrent users
- Embed cache: 10,000-entry LRU (~35–50% hit rate at scale)
- Embed pool: 6 worker processes (GIL-bypassing true parallelism)
- BM25 index:
200 MB RAM, built once at startup (60s)
Hardware tiers
| Feature | Free tier (2 vCPU / 16 GB) | CPU Upgrade (8 vCPU / 32 GB) |
|---|---|---|
| BM25 + FAISS hybrid | ✅ | ✅ |
| BM25 pkl load time | ~2s | ~2s |
| RRF + multi-query | ✅ | ✅ |
| Query expansion | ✅ | ✅ |
| Context window + snippets | ✅ | ✅ |
| Score tiers + ਅੰਗ links | ✅ | ✅ |
| Embed cache (10K LRU) | ✅ | ✅ |
| MMR reranking | ❌ disabled | ✅ enabled |
| Gradio max threads | 8 | 40 |
| Queue max size | 20 | 100 |
| Comfortable concurrent users | ~50 | ~300 |
| Set Space secret FREE_TIER=true on free tier. Leave unset on CPU Upgrade. |
License
CC BY-NC-ND 4.0 — creativecommons.org/licenses/by-nc-nd/4.0
Attribution: SikhLibrary Dataset (jsdosanj/SikhLibrary) · ShabadOS · BaniDB · Sikhi.IO · Isher Micro Media
· Maintainer: Bhai Jasvant Singh ਪੰਛੀ