--- title: AI Sikh Librarian emoji: ๐Ÿ“š colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 6.14.0 app_file: app.py pinned: true hardware: cpu-upgrade license: cc-by-nc-nd-4.0 thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/69a29db42927b6e81519bb2d/8V-N8NjKd-rFtBwZh3dxb.jpeg short_description: A scholarly research tool for querying the SikhLibrary --- # ๐Ÿ“š AI Sikh Librarian A scholarly research tool for querying the [SikhLibrary](https://huggingface.co/datasets/jsdosanj/SikhLibrary) โ€” 758 million+ words of Sikh scriptures, history, literature, and research materials. Built for **academic researchers and parchariks** who need fast, reliable access to primary sources with proper academic citations. Built by Bhai Jasvant Singh เจชเฉฐเจ›เฉ€ in service to Sache Paatshaah. --- ### ๐ŸŒ— Dark / Light Mode Toggle between dark and light themes using the button in the top-right corner. Preference is saved in the browser across sessions. ## Modes ### ๐Ÿ” Research Mode โ€” no LLM, no credits, always available | Signal | Technology | What it catches | |---|---|---| | Semantic similarity | FAISS + all-MiniLM-L6-v2 | Conceptual queries, synonyms, related themes | | Exact / lexical match | BM25 (bm25s) | Scripture titles, เจ…เฉฐเจ— numbers, proper nouns | | Score fusion | Reciprocal Rank Fusion | Best of both signals | | Query enrichment | Romanisation โ†’ Gurmukhi | "waheguru" โ†’ เจตเจพเจนเจฟเจ—เฉเจฐเฉ‚ | | Source diversity | Per-source cap (15 max) | Prevents any single source dominating | | Result diversity | MMR reranking (ฮป=0.7) | Diverse passages, not 5 from same เจ…เฉฐเจ— | | Gurbani extraction | Regex field parser | Gurmukhi ยท Pronunciation ยท Translation ยท Explanation | | Language labelling | langdetect | "Translation (Romanian)", "Explanation (English)" | | Relevance display | Percentile rank | 100% = most relevant in this query (intuitive, higher = better) | | Score tiers | Percentile labels | ๐ŸŸข Highly Relevant โ†’ โšช Peripheral | | Gurbani deep links | เจ…เฉฐเจ— โ†’ SikhiToTheMax | One-click source verification | | Citations | Chicago Manual of Style 17th ed. | Ready for academic papers | ### ๐Ÿ“– Learn Mode โ€” Qwen2.5-72B scholarly Q&A - PhD-level theological and historical analysis in English - All key terms in Punjabi Unicode (เจจเจพเจฎ เจธเจฟเจฎเจฐเจจ, เจธเฉฐเจค-เจธเจฟเจชเจพเจนเฉ€ โ€” never romanised) - Automatic เจ…เฉฐเจ— hallucination verification โ€” unverified citations struck through - Multi-turn conversation with 6-turn history ## Corpus Coverage | Category | Contents | |---|---| | **เจ—เฉเจฐเจฌเจพเจฃเฉ€** | เจธเฉเจฐเฉ€ เจ—เฉเจฐเฉ‚ เจ—เจฐเฉฐเจฅ เจธเจพเจนเจฟเจฌ เจœเฉ€, เจธเฉเจฐเฉ€ เจฆเจธเจฎ เจ—เฉเจฐเฉฐเจฅ, เจธเฉเจฐเฉ€ เจธเจฐเจฌเจฒเฉ‹เจน เจ—เฉเจฐเฉฐเจฅ, เจšเจฐเจฟเจคเฉเจฐเฉ‹เจชเจพเจ–เจฟเจ†เจจ, เจญเจพเจˆ เจจเฉฐเจฆ เจฒเจพเจฒ เจ—เฉ‹เจฏเจพ | | **เจ—เฉเจฐเฉฐเจฅ** | เจธเฉ‚เจฐเจœ เจชเฉเจฐเจ•เจพเจธเจผ, เจชเฉฐเจฅ เจชเฉเจฐเจ•เจพเจธเจผ, เจฎเจนเจพเจจ เจ•เฉ‹เจธเจผ, เจ—เฉเจฐเจฌเจฟเจฒเจพเจธ, เจคเจจเจ–เจผเจพเจนเจจเจพเจฎเจพ | | **เจธเจŸเฉ€เจ•** | เจซเจฐเฉ€เจฆเจ•เฉ‹เจŸ เจตเจพเจฒเจพ เจŸเฉ€เจ•เจพ, เจ—เฉเจฐเฉ‚ เจ—เจฐเฉฐเจฅ เจฆเจฐเจชเจฃ, เจฆเจธเจฎ เจ—เฉเจฐเฉฐเจฅ เจŸเฉ€เจ•เจพ (8 vols.) | | **เจธเจพเจนเจฟเจค** | เจœเจจเจฎ เจธเจพเจ–เฉ€เจ†เจ‚, เจœเฉฐเจ—เจจเจพเจฎเฉ‡, เจธเจผเจนเฉ€เจฆ เจฌเจฟเจฒเจพเจธ, เจชเฉเจฐเจพเจคเจจ เจœเจจเจฎ เจธเจพเจ–เฉ€ | | **เจ–เฉ‹เจœ** | Sikh Encyclopedia, SikhiWiki Archive, Ten Gurus, Gurdwaras Database | ## Dataset [`jsdosanj/SikhLibrary`](https://huggingface.co/datasets/jsdosanj/SikhLibrary) โ€” 331,771 chunks ยท 758M+ words ยท 5 categories: เจ—เฉเจฐเจฌเจพเจฃเฉ€, เจ—เฉเจฐเฉฐเจฅ, เจธเจŸเฉ€เจ•, เจธเจพเจนเจฟเจค, เจ–เฉ‹เจœ ## Storage bucket โ€” `jsdosanj/SikhLibrarian-storage/index_output_v2/` | File | Size | Builder | |---|---|---| | `faiss.index` | ~3.5 GB | `build_index_local_v2.py` | | `doc_store.sqlite` | ~6.5 GB | `build_index_local_v2.py` | | `meta.json` | ~180 MB | `build_index_local_v2.py` | | `bm25.pkl` | ~200 MB | `build_bm25_local.py` (bm25s edition) | --- ## Architecture ``` query โ””โ”€ sanitise + expand (romanisation map) โ”œโ”€ FAISS vector search (semantic) โ”€โ” โ””โ”€ BM25 lexical search (exact) โ”€โ”ดโ”€ RRF fusion โ””โ”€ per-source cap (diversity) โ””โ”€ MMR rerank (diversity) โ””โ”€ percentile calibration โ””โ”€ structured extraction โ””โ”€ lang detection + Chicago citations ``` ## Performance (CPU Upgrade โ€” 8 vCPU / 32 GB RAM) - Research mode: < 500ms per query for 100โ€“300 concurrent users - Embed cache: 10,000-entry LRU (~35โ€“50% hit rate at scale) - Embed pool: 6 worker processes (GIL-bypassing true parallelism) - BM25 index: ~200 MB RAM, built once at startup (~60s) ## Hardware tiers | Feature | Free tier (2 vCPU / 16 GB) | CPU Upgrade (8 vCPU / 32 GB) | |---|---|---| | BM25 + FAISS hybrid | โœ… | โœ… | | BM25 pkl load time | ~2s | ~2s | | RRF + multi-query | โœ… | โœ… | | Query expansion | โœ… | โœ… | | Context window + snippets | โœ… | โœ… | | Score tiers + เจ…เฉฐเจ— links | โœ… | โœ… | | Embed cache (10K LRU) | โœ… | โœ… | | MMR reranking | โŒ disabled | โœ… enabled | | Gradio max threads | 8 | 40 | | Queue max size | 20 | 100 | | Comfortable concurrent users | ~50 | ~300 | Set Space secret FREE_TIER=true on free tier. Leave unset on CPU Upgrade. ## License **CC BY-NC-ND 4.0** โ€” [creativecommons.org/licenses/by-nc-nd/4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) Attribution: **SikhLibrary Dataset** (`jsdosanj/SikhLibrary`) ยท ShabadOS ยท BaniDB ยท Sikhi.IO ยท Isher Micro Media ยท Maintainer: **Bhai Jasvant Singh เจชเฉฐเจ›เฉ€**