PipeOwl
Collection
A transformer-free semantic retrieval engine. • 13 items • Updated
A transformer-free semantic retrieval engine for multilingual wiki retrieval and passage expansion.
It combines:
If you need a faster query version, replace the files in the `fast_query` folder with those in the root directory.
The initial loading time will be longer (approximately +4–5 seconds), but the query time can be reduced from around 25 seconds to under 1 second.
如果你需要query更快的版本 把fast_query資料夾內的檔案覆蓋到主目錄下
第一次載入會比較久 載入時間會多個4~5秒 但查詢時間可以從25秒壓到1秒內
| metric | normal mode | fast_query mode |
|---|---|---|
| startup time (default) | ~7.3 s | ~13 s |
| query latency | 0.74 ~ 22 s | < 1 s ⚡🚀 |
| item | value |
|---|---|
| token size | 734803 |
| embedding dim | 512 |
| storage format | safetensors (FP16) |
| all data size | ~5.44 GB |
| model data size | ~728 MB |
| wiki data size | ~4.73 GB |
| languages | multilingual |
git clone https://huggingface.co/WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag/
cd PipeOwl-1.10.2-tw-wiki-rag/
pip install numpy safetensors
python quickstart_wiki_merge.py
請見專案內 example.md
範例:
Input:
bang dream
Output:
BanG_Dream!BanG_Dream!See example.md for full examples.
PipeOwl-1.10.2-tw-wiki-rag/
├─ quickstart_wiki_merge.py #入口
├─ wiki_retriever_merge.py #wiki資料搜尋
├─ entity_layer.py #同義層邏輯
├─ tokenizer_priority.py #切詞器邏輯
├─ engine.py #pipeowl核心模組入口
├─ entity_alias.json #同義詞資料集 (未完備)
├─ phrase_lexicon.txt #完整詞保護
├─ tokenizer.json #核心tokenizer
├─ pipeowl.safetensors #核心向量矩陣儲存
├─ zhwiki_clean.jsonl #wikidata資料集
└─ wiki_index_merge/
├─ wiki_merge_meta.json
├─ wiki_titles.txt
├─ wiki_title_tokens.jsonl
├─ wiki_title_tokens_offsets.json
├─ wiki_token_to_title_ids.json
└─ wiki_title_offsets.json
Code and system components in this repository are released under the MIT License.
Included Wikipedia-derived retrieval data is subject to CC BY-SA 4.0. Please ensure attribution and share-alike compliance if redistributing derived wiki data.
Base model
WangKaiLin/PipeOwl-1.10-multilingual