PipeOwl-1.10.2-tw-wiki-rag

A transformer-free semantic retrieval engine for multilingual wiki retrieval and passage expansion.

It combines:

a static embedding field
entity normalization
phrase-aware tokenization
merge-based title retrieval
local passage spread from matched wiki entries

Performance Mode Switch

If you need a faster query version, replace the files in the `fast_query` folder with those in the root directory.

The initial loading time will be longer (approximately +4–5 seconds), but the query time can be reduced from around 25 seconds to under 1 second.

如果你需要query更快的版本 把fast_query資料夾內的檔案覆蓋到主目錄下
第一次載入會比較久 載入時間會多個4~5秒 但查詢時間可以從25秒壓到1秒內

metric	normal mode	fast_query mode
startup time (default)	~7.3 s	~13 s
query latency	0.74 ~ 22 s	< 1 s ⚡🚀

Current Limitations

Query latency is still dominated by Python-side retrieval logic.
entity_alias.json is incomplete and still growing.
phrase_lexicon.txt is heuristic and hand-curated.
Passage spread is a lightweight local retrieval step, not a full global chunk index.
Wiki freshness depends on the included zhwiki dump.

Architecture

Static embedding table (V × D)
Aligned vocabulary index
Linear scoring
Pluggable decoder stage

Model Specs

item	value
token size	734803
embedding dim	512
storage format	safetensors (FP16)
all data size	~5.44 GB
model data size	~728 MB
wiki data size	~4.73 GB
languages	multilingual

Quickstart

Tested on Python 3.10+

Unzip pipeowl1.10.2.zip

git clone https://huggingface.co/WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag/
cd PipeOwl-1.10.2-tw-wiki-rag/

pip install numpy safetensors

python quickstart_wiki_merge.py

Example:

請見專案內 example.md

範例: Input: bang dream

Output:

normalized query: BanG_Dream!
top title: BanG_Dream!
knowledge spread: a matched passage from the wiki article

See example.md for full examples.

Repository Structure

PipeOwl-1.10.2-tw-wiki-rag/
├─ quickstart_wiki_merge.py #入口 
├─ wiki_retriever_merge.py #wiki資料搜尋
├─ entity_layer.py #同義層邏輯
├─ tokenizer_priority.py #切詞器邏輯
├─ engine.py #pipeowl核心模組入口
├─ entity_alias.json #同義詞資料集 (未完備)
├─ phrase_lexicon.txt #完整詞保護
├─ tokenizer.json #核心tokenizer
├─ pipeowl.safetensors #核心向量矩陣儲存
├─ zhwiki_clean.jsonl #wikidata資料集
└─ wiki_index_merge/ 
   ├─ wiki_merge_meta.json 
   ├─ wiki_titles.txt 
   ├─ wiki_title_tokens.jsonl
   ├─ wiki_title_tokens_offsets.json
   ├─ wiki_token_to_title_ids.json
   └─ wiki_title_offsets.json

License

Code and system components in this repository are released under the MIT License.

Included Wikipedia-derived retrieval data is subject to CC BY-SA 4.0. Please ensure attribution and share-alike compliance if redistributing derived wiki data.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag

Base model

WangKaiLin/PipeOwl-1.10-multilingual

Finetuned

(1)

this model

Collections including WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag

PipeOwl

Collection

A transformer-free semantic retrieval engine. • 13 items • Updated Apr 25

Latest

Collection

4 items • Updated Apr 25