HugSena13
/

neuroRewriter

query-rewriting

Model card Files Files and versions

neuroRewriter / README.md

HugSena13's picture

Update README.md

78a138e verified 20 days ago

|

history blame contribute delete

3.02 kB

	---
	license: apache-2.0
	tags:
	- medical
	- neurology
	- neurosurgery
	- search
	- rag
	- query-rewriting
	datasets:
	- miriad/miriad-4.4M
	base_model:
	- google/flan-t5-small
	language: en
	pipeline_tag: summarization
	---

	# NeuroRewriter: Neurology & Neurosurgery Query Optimizer

	## 🩺 Model Description
	NeuroRewriter is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of Neurology and Neurosurgery.

	Its primary function is to act as a Query Rewriter in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures).

	## 🚀 Intended Use & Best Practices

	### 1. RAG Pipeline Integration
	This model is designed to sit between the User and your Vector Database/Search Engine.
	* Input: "What are the common complications after a craniotomy?"
	* Output: "craniotomy complications post-op"

	### 2. Retrieval Strategy (Important)
	This model is optimized for Keyword-Based Retrieval (Sparse Retrieval) methods such as:
	* BM25
	* TF-IDF
	* Splade
	* Elasticsearch / OpenSearch

	> Note: Because this model removes grammatical connectors ("stop words") to boost keyword density, it is less effective for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25.

	## ⚠️ Limitations & Medical Disclaimer
	NOT FOR CLINICAL DIAGNOSIS.
	This model is intended for informational retrieval purposes only.
	* It is not a doctor and should not be used to make medical decisions.
	* While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts).
	* Always verify results against trusted medical sources.

	## 📚 Training Data
	This model was fine-tuned on a curated subset of the MIRIAD dataset (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery).
	* License: ODC-By 1.0
	* Attribution: Zheng et al. (2025)

	## 💻 How to Use
	```python
	# pip install transformers torch

	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# 1. Load the model
	model_name = "HugSena13/neroRewriter"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	# 2. Prepare the input (Include the prefix used in training!)
	input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?"
	inputs = tokenizer(input_text, return_tensors="pt")

	# 3. Generate (Adjust max_new_tokens if output is cut off)
	outputs = model.generate(
	**inputs,
	max_new_tokens=50,
	num_beams=5,
	early_stopping=True
	)

	# 4. Decode
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)