File size: 3,020 Bytes
7e5410a
 
 
 
 
 
 
 
 
 
 
 
 
 
78a138e
7e5410a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: apache-2.0
tags:
- medical
- neurology
- neurosurgery
- search
- rag
- query-rewriting
datasets:
- miriad/miriad-4.4M
base_model:
- google/flan-t5-small
language: en
pipeline_tag: summarization
---

# NeuroRewriter: Neurology & Neurosurgery Query Optimizer

## 🩺 Model Description
**NeuroRewriter** is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of **Neurology** and **Neurosurgery**. 

Its primary function is to act as a **Query Rewriter** in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures).

## 🚀 Intended Use & Best Practices

### 1. RAG Pipeline Integration
This model is designed to sit between the User and your Vector Database/Search Engine.
* **Input:** "What are the common complications after a craniotomy?"
* **Output:** "craniotomy complications post-op"

### 2. Retrieval Strategy (Important)
This model is optimized for **Keyword-Based Retrieval (Sparse Retrieval)** methods such as:
* **BM25**
* **TF-IDF**
* **Splade**
* **Elasticsearch / OpenSearch**

> **Note:** Because this model removes grammatical connectors ("stop words") to boost keyword density, it is **less effective** for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25.

## ⚠️ Limitations & Medical Disclaimer
**NOT FOR CLINICAL DIAGNOSIS.**
This model is intended for **informational retrieval purposes only**.
* It is not a doctor and should not be used to make medical decisions.
* While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts).
* Always verify results against trusted medical sources.

## 📚 Training Data
This model was fine-tuned on a curated subset of the **MIRIAD dataset** (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery).
* **License:** ODC-By 1.0
* **Attribution:** Zheng et al. (2025)

## 💻 How to Use
```python
# pip install transformers torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the model
model_name = "HugSena13/neroRewriter" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 2. Prepare the input (Include the prefix used in training!)
input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?"
inputs = tokenizer(input_text, return_tensors="pt")

# 3. Generate (Adjust max_new_tokens if output is cut off)
outputs = model.generate(
    **inputs, 
    max_new_tokens=50,   
    num_beams=5,        
    early_stopping=True
)

# 4. Decode
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)