|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- medical |
|
|
- neurology |
|
|
- neurosurgery |
|
|
- search |
|
|
- rag |
|
|
- query-rewriting |
|
|
datasets: |
|
|
- miriad/miriad-4.4M |
|
|
base_model: |
|
|
- google/flan-t5-small |
|
|
language: en |
|
|
pipeline_tag: summarization |
|
|
--- |
|
|
|
|
|
# NeuroRewriter: Neurology & Neurosurgery Query Optimizer |
|
|
|
|
|
## π©Ί Model Description |
|
|
**NeuroRewriter** is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of **Neurology** and **Neurosurgery**. |
|
|
|
|
|
Its primary function is to act as a **Query Rewriter** in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures). |
|
|
|
|
|
## π Intended Use & Best Practices |
|
|
|
|
|
### 1. RAG Pipeline Integration |
|
|
This model is designed to sit between the User and your Vector Database/Search Engine. |
|
|
* **Input:** "What are the common complications after a craniotomy?" |
|
|
* **Output:** "craniotomy complications post-op" |
|
|
|
|
|
### 2. Retrieval Strategy (Important) |
|
|
This model is optimized for **Keyword-Based Retrieval (Sparse Retrieval)** methods such as: |
|
|
* **BM25** |
|
|
* **TF-IDF** |
|
|
* **Splade** |
|
|
* **Elasticsearch / OpenSearch** |
|
|
|
|
|
> **Note:** Because this model removes grammatical connectors ("stop words") to boost keyword density, it is **less effective** for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25. |
|
|
|
|
|
## β οΈ Limitations & Medical Disclaimer |
|
|
**NOT FOR CLINICAL DIAGNOSIS.** |
|
|
This model is intended for **informational retrieval purposes only**. |
|
|
* It is not a doctor and should not be used to make medical decisions. |
|
|
* While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts). |
|
|
* Always verify results against trusted medical sources. |
|
|
|
|
|
## π Training Data |
|
|
This model was fine-tuned on a curated subset of the **MIRIAD dataset** (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery). |
|
|
* **License:** ODC-By 1.0 |
|
|
* **Attribution:** Zheng et al. (2025) |
|
|
|
|
|
## π» How to Use |
|
|
```python |
|
|
# pip install transformers torch |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
# 1. Load the model |
|
|
model_name = "HugSena13/neroRewriter" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
|
|
# 2. Prepare the input (Include the prefix used in training!) |
|
|
input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?" |
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
|
|
# 3. Generate (Adjust max_new_tokens if output is cut off) |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=50, |
|
|
num_beams=5, |
|
|
early_stopping=True |
|
|
) |
|
|
|
|
|
# 4. Decode |
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(result) |