Spaces:
Sleeping
Sleeping
File size: 5,884 Bytes
1e98f65 7fd9895 ae4ef9b 969f47f 7fd9895 1e98f65 afe643a 1e98f65 7fd9895 3680586 969f47f 1e98f65 31649ba ae4ef9b 31649ba 65c4a9d 31649ba 65c4a9d 31649ba f995978 65c4a9d 0f4d160 65c4a9d 31649ba 7fd9895 7aced2e ac27838 7aced2e 7fd9895 65c4a9d 3680586 65c4a9d 022f3bf 3680586 7fd9895 022f3bf 7fd9895 3680586 3401625 65c4a9d 3401625 0f4d160 3401625 0f4d160 3401625 65c4a9d 3680586 ac27838 7fd9895 ac27838 7fd9895 7a99ea8 a44d52f 3401625 a44d52f 7fd9895 65c4a9d 3680586 022f3bf 3680586 3401625 7a99ea8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
title: AI Sikh Librarian
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: true
hardware: cpu-upgrade
license: cc-by-nc-nd-4.0
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/69a29db42927b6e81519bb2d/8V-N8NjKd-rFtBwZh3dxb.jpeg
short_description: A scholarly research tool for querying the SikhLibrary
---
# 📚 AI Sikh Librarian
A scholarly research tool for querying the
[SikhLibrary](https://huggingface.co/datasets/jsdosanj/SikhLibrary) —
758 million+ words of Sikh scriptures, history, literature, and research materials.
Built for **academic researchers and parchariks** who need fast, reliable access to
primary sources with proper academic citations.
Built by Bhai Jasvant Singh ਪੰਛੀ in service to Sache Paatshaah.
---
### 🌗 Dark / Light Mode
Toggle between dark and light themes using the button in the top-right corner.
Preference is saved in the browser across sessions.
## Modes
### 🔍 Research Mode — no LLM, no credits, always available
| Signal | Technology | What it catches |
|---|---|---|
| Semantic similarity | FAISS + all-MiniLM-L6-v2 | Conceptual queries, synonyms, related themes |
| Exact / lexical match | BM25 (bm25s) | Scripture titles, ਅੰਗ numbers, proper nouns |
| Score fusion | Reciprocal Rank Fusion | Best of both signals |
| Query enrichment | Romanisation → Gurmukhi | "waheguru" → ਵਾਹਿਗੁਰੂ |
| Source diversity | Per-source cap (15 max) | Prevents any single source dominating |
| Result diversity | MMR reranking (λ=0.7) | Diverse passages, not 5 from same ਅੰਗ |
| Gurbani extraction | Regex field parser | Gurmukhi · Pronunciation · Translation · Explanation |
| Language labelling | langdetect | "Translation (Romanian)", "Explanation (English)" |
| Relevance display | Percentile rank | 100% = most relevant in this query (intuitive, higher = better) |
| Score tiers | Percentile labels | 🟢 Highly Relevant → ⚪ Peripheral |
| Gurbani deep links | ਅੰਗ → SikhiToTheMax | One-click source verification |
| Citations | Chicago Manual of Style 17th ed. | Ready for academic papers |
### 📖 Learn Mode — Qwen2.5-72B scholarly Q&A
- PhD-level theological and historical analysis in English
- All key terms in Punjabi Unicode (ਨਾਮ ਸਿਮਰਨ, ਸੰਤ-ਸਿਪਾਹੀ — never romanised)
- Automatic ਅੰਗ hallucination verification — unverified citations struck through
- Multi-turn conversation with 6-turn history
## Corpus Coverage
| Category | Contents |
|---|---|
| **ਗੁਰਬਾਣੀ** | ਸ੍ਰੀ ਗੁਰੂ ਗਰੰਥ ਸਾਹਿਬ ਜੀ, ਸ੍ਰੀ ਦਸਮ ਗ੍ਰੰਥ, ਸ੍ਰੀ ਸਰਬਲੋਹ ਗ੍ਰੰਥ, ਚਰਿਤ੍ਰੋਪਾਖਿਆਨ, ਭਾਈ ਨੰਦ ਲਾਲ ਗੋਯਾ |
| **ਗ੍ਰੰਥ** | ਸੂਰਜ ਪ੍ਰਕਾਸ਼, ਪੰਥ ਪ੍ਰਕਾਸ਼, ਮਹਾਨ ਕੋਸ਼, ਗੁਰਬਿਲਾਸ, ਤਨਖ਼ਾਹਨਾਮਾ |
| **ਸਟੀਕ** | ਫਰੀਦਕੋਟ ਵਾਲਾ ਟੀਕਾ, ਗੁਰੂ ਗਰੰਥ ਦਰਪਣ, ਦਸਮ ਗ੍ਰੰਥ ਟੀਕਾ (8 vols.) |
| **ਸਾਹਿਤ** | ਜਨਮ ਸਾਖੀਆਂ, ਜੰਗਨਾਮੇ, ਸ਼ਹੀਦ ਬਿਲਾਸ, ਪੁਰਾਤਨ ਜਨਮ ਸਾਖੀ |
| **ਖੋਜ** | Sikh Encyclopedia, SikhiWiki Archive, Ten Gurus, Gurdwaras Database |
## Dataset
[`jsdosanj/SikhLibrary`](https://huggingface.co/datasets/jsdosanj/SikhLibrary)
— 331,771 chunks · 758M+ words · 5 categories: ਗੁਰਬਾਣੀ, ਗ੍ਰੰਥ, ਸਟੀਕ, ਸਾਹਿਤ, ਖੋਜ
## Storage bucket — `jsdosanj/SikhLibrarian-storage/index_output_v2/`
| File | Size | Builder |
|---|---|---|
| `faiss.index` | ~3.5 GB | `build_index_local_v2.py` |
| `doc_store.sqlite` | ~6.5 GB | `build_index_local_v2.py` |
| `meta.json` | ~180 MB | `build_index_local_v2.py` |
| `bm25.pkl` | ~200 MB | `build_bm25_local.py` (bm25s edition) |
---
## Architecture
```
query
└─ sanitise + expand (romanisation map)
├─ FAISS vector search (semantic) ─┐
└─ BM25 lexical search (exact) ─┴─ RRF fusion
└─ per-source cap (diversity)
└─ MMR rerank (diversity)
└─ percentile calibration
└─ structured extraction
└─ lang detection + Chicago citations
```
## Performance (CPU Upgrade — 8 vCPU / 32 GB RAM)
- Research mode: < 500ms per query for 100–300 concurrent users
- Embed cache: 10,000-entry LRU (~35–50% hit rate at scale)
- Embed pool: 6 worker processes (GIL-bypassing true parallelism)
- BM25 index: ~200 MB RAM, built once at startup (~60s)
## Hardware tiers
| Feature | Free tier (2 vCPU / 16 GB) | CPU Upgrade (8 vCPU / 32 GB) |
|---|---|---|
| BM25 + FAISS hybrid | ✅ | ✅ |
| BM25 pkl load time | ~2s | ~2s |
| RRF + multi-query | ✅ | ✅ |
| Query expansion | ✅ | ✅ |
| Context window + snippets | ✅ | ✅ |
| Score tiers + ਅੰਗ links | ✅ | ✅ |
| Embed cache (10K LRU) | ✅ | ✅ |
| MMR reranking | ❌ disabled | ✅ enabled |
| Gradio max threads | 8 | 40 |
| Queue max size | 20 | 100 |
| Comfortable concurrent users | ~50 | ~300 |
Set Space secret FREE_TIER=true on free tier. Leave unset on CPU Upgrade.
## License
**CC BY-NC-ND 4.0** — [creativecommons.org/licenses/by-nc-nd/4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)
Attribution: **SikhLibrary Dataset** (`jsdosanj/SikhLibrary`) · ShabadOS · BaniDB · Sikhi.IO · Isher Micro Media
· Maintainer: **Bhai Jasvant Singh ਪੰਛੀ** |