File size: 5,884 Bytes
1e98f65
7fd9895
ae4ef9b
969f47f
7fd9895
1e98f65
afe643a
1e98f65
7fd9895
 
3680586
969f47f
 
 
1e98f65
31649ba
ae4ef9b
31649ba
65c4a9d
 
 
31649ba
65c4a9d
 
31649ba
f995978
 
65c4a9d
0f4d160
 
 
 
 
65c4a9d
31649ba
7fd9895
 
 
 
7aced2e
 
 
 
 
 
ac27838
7aced2e
 
 
 
7fd9895
 
 
 
 
 
 
65c4a9d
 
3680586
65c4a9d
 
022f3bf
 
 
 
 
3680586
7fd9895
022f3bf
7fd9895
 
3680586
3401625
65c4a9d
3401625
0f4d160
 
3401625
0f4d160
3401625
65c4a9d
 
3680586
ac27838
7fd9895
 
 
 
 
ac27838
 
 
 
 
 
 
7fd9895
 
 
 
 
 
 
 
 
7a99ea8
a44d52f
 
3401625
 
 
 
 
 
 
 
 
 
 
 
 
a44d52f
7fd9895
65c4a9d
3680586
022f3bf
3680586
3401625
7a99ea8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: AI Sikh Librarian
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: true
hardware: cpu-upgrade
license: cc-by-nc-nd-4.0
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/69a29db42927b6e81519bb2d/8V-N8NjKd-rFtBwZh3dxb.jpeg
short_description: A scholarly research tool for querying the SikhLibrary
---

# 📚 AI Sikh Librarian

A scholarly research tool for querying the
[SikhLibrary](https://huggingface.co/datasets/jsdosanj/SikhLibrary) —
758 million+ words of Sikh scriptures, history, literature, and research materials.

Built for **academic researchers and parchariks** who need fast, reliable access to
primary sources with proper academic citations.

Built by Bhai Jasvant Singh ਪੰਛੀ in service to Sache Paatshaah.

---

### 🌗 Dark / Light Mode
Toggle between dark and light themes using the button in the top-right corner.
Preference is saved in the browser across sessions.

## Modes

### 🔍 Research Mode — no LLM, no credits, always available

| Signal | Technology | What it catches |
|---|---|---|
| Semantic similarity | FAISS + all-MiniLM-L6-v2 | Conceptual queries, synonyms, related themes |
| Exact / lexical match | BM25 (bm25s) | Scripture titles, ਅੰਗ numbers, proper nouns |
| Score fusion | Reciprocal Rank Fusion | Best of both signals |
| Query enrichment | Romanisation → Gurmukhi | "waheguru" → ਵਾਹਿਗੁਰੂ |
| Source diversity | Per-source cap (15 max) | Prevents any single source dominating |
| Result diversity | MMR reranking (λ=0.7) | Diverse passages, not 5 from same ਅੰਗ |
| Gurbani extraction | Regex field parser | Gurmukhi · Pronunciation · Translation · Explanation |
| Language labelling | langdetect | "Translation (Romanian)", "Explanation (English)" |
| Relevance display | Percentile rank | 100% = most relevant in this query (intuitive, higher = better) |
| Score tiers | Percentile labels | 🟢 Highly Relevant → ⚪ Peripheral |
| Gurbani deep links | ਅੰਗ → SikhiToTheMax | One-click source verification |
| Citations | Chicago Manual of Style 17th ed. | Ready for academic papers |

### 📖 Learn Mode — Qwen2.5-72B scholarly Q&A
- PhD-level theological and historical analysis in English
- All key terms in Punjabi Unicode (ਨਾਮ ਸਿਮਰਨ, ਸੰਤ-ਸਿਪਾਹੀ — never romanised)
- Automatic ਅੰਗ hallucination verification — unverified citations struck through
- Multi-turn conversation with 6-turn history

## Corpus Coverage

| Category | Contents |
|---|---|
| **ਗੁਰਬਾਣੀ** | ਸ੍ਰੀ ਗੁਰੂ ਗਰੰਥ ਸਾਹਿਬ ਜੀ, ਸ੍ਰੀ ਦਸਮ ਗ੍ਰੰਥ, ਸ੍ਰੀ ਸਰਬਲੋਹ ਗ੍ਰੰਥ, ਚਰਿਤ੍ਰੋਪਾਖਿਆਨ, ਭਾਈ ਨੰਦ ਲਾਲ ਗੋਯਾ |
| **ਗ੍ਰੰਥ** | ਸੂਰਜ ਪ੍ਰਕਾਸ਼, ਪੰਥ ਪ੍ਰਕਾਸ਼, ਮਹਾਨ ਕੋਸ਼, ਗੁਰਬਿਲਾਸ, ਤਨਖ਼ਾਹਨਾਮਾ |
| **ਸਟੀਕ** | ਫਰੀਦਕੋਟ ਵਾਲਾ ਟੀਕਾ, ਗੁਰੂ ਗਰੰਥ ਦਰਪਣ, ਦਸਮ ਗ੍ਰੰਥ ਟੀਕਾ (8 vols.) |
| **ਸਾਹਿਤ** | ਜਨਮ ਸਾਖੀਆਂ, ਜੰਗਨਾਮੇ, ਸ਼ਹੀਦ ਬਿਲਾਸ, ਪੁਰਾਤਨ ਜਨਮ ਸਾਖੀ |
| **ਖੋਜ** | Sikh Encyclopedia, SikhiWiki Archive, Ten Gurus, Gurdwaras Database |

## Dataset

[`jsdosanj/SikhLibrary`](https://huggingface.co/datasets/jsdosanj/SikhLibrary)
— 331,771 chunks · 758M+ words · 5 categories: ਗੁਰਬਾਣੀ, ਗ੍ਰੰਥ, ਸਟੀਕ, ਸਾਹਿਤ, ਖੋਜ

## Storage bucket — `jsdosanj/SikhLibrarian-storage/index_output_v2/`

| File | Size | Builder |
|---|---|---|
| `faiss.index` | ~3.5 GB | `build_index_local_v2.py` |
| `doc_store.sqlite` | ~6.5 GB | `build_index_local_v2.py` |
| `meta.json` | ~180 MB | `build_index_local_v2.py` |
| `bm25.pkl` | ~200 MB | `build_bm25_local.py` (bm25s edition) |

---


## Architecture

```
query
  └─ sanitise + expand (romanisation map)
       ├─ FAISS vector search  (semantic)  ─┐
       └─ BM25 lexical search  (exact)     ─┴─ RRF fusion
                                                  └─ per-source cap (diversity)
                                                       └─ MMR rerank (diversity)
                                                            └─ percentile calibration
                                                                 └─ structured extraction
                                                                      └─ lang detection + Chicago citations
```

## Performance (CPU Upgrade — 8 vCPU / 32 GB RAM)

- Research mode: < 500ms per query for 100–300 concurrent users
- Embed cache: 10,000-entry LRU (~35–50% hit rate at scale)
- Embed pool: 6 worker processes (GIL-bypassing true parallelism)
- BM25 index: ~200 MB RAM, built once at startup (~60s)


## Hardware tiers

| Feature | Free tier (2 vCPU / 16 GB) | CPU Upgrade (8 vCPU / 32 GB) |
|---|---|---|
| BM25 + FAISS hybrid | ✅ | ✅ |
| BM25 pkl load time | ~2s | ~2s |
| RRF + multi-query | ✅ | ✅ |
| Query expansion | ✅ | ✅ |
| Context window + snippets | ✅ | ✅ |
| Score tiers + ਅੰਗ links | ✅ | ✅ |
| Embed cache (10K LRU) | ✅ | ✅ |
| MMR reranking | ❌ disabled | ✅ enabled |
| Gradio max threads | 8 | 40 |
| Queue max size | 20 | 100 |
| Comfortable concurrent users | ~50 | ~300 |
Set Space secret FREE_TIER=true on free tier. Leave unset on CPU Upgrade.

## License

**CC BY-NC-ND 4.0** — [creativecommons.org/licenses/by-nc-nd/4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)

Attribution: **SikhLibrary Dataset** (`jsdosanj/SikhLibrary`) · ShabadOS · BaniDB · Sikhi.IO · Isher Micro Media
· Maintainer: **Bhai Jasvant Singh ਪੰਛੀ**