--- license: apache-2.0 tags: - medical - neurology - neurosurgery - search - rag - query-rewriting datasets: - miriad/miriad-4.4M base_model: - google/flan-t5-small language: en pipeline_tag: summarization --- # NeuroRewriter: Neurology & Neurosurgery Query Optimizer ## 🩺 Model Description **NeuroRewriter** is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of **Neurology** and **Neurosurgery**. Its primary function is to act as a **Query Rewriter** in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures). ## 🚀 Intended Use & Best Practices ### 1. RAG Pipeline Integration This model is designed to sit between the User and your Vector Database/Search Engine. * **Input:** "What are the common complications after a craniotomy?" * **Output:** "craniotomy complications post-op" ### 2. Retrieval Strategy (Important) This model is optimized for **Keyword-Based Retrieval (Sparse Retrieval)** methods such as: * **BM25** * **TF-IDF** * **Splade** * **Elasticsearch / OpenSearch** > **Note:** Because this model removes grammatical connectors ("stop words") to boost keyword density, it is **less effective** for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25. ## ⚠️ Limitations & Medical Disclaimer **NOT FOR CLINICAL DIAGNOSIS.** This model is intended for **informational retrieval purposes only**. * It is not a doctor and should not be used to make medical decisions. * While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts). * Always verify results against trusted medical sources. ## 📚 Training Data This model was fine-tuned on a curated subset of the **MIRIAD dataset** (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery). * **License:** ODC-By 1.0 * **Attribution:** Zheng et al. (2025) ## 💻 How to Use ```python # pip install transformers torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # 1. Load the model model_name = "HugSena13/neroRewriter" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # 2. Prepare the input (Include the prefix used in training!) input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?" inputs = tokenizer(input_text, return_tensors="pt") # 3. Generate (Adjust max_new_tokens if output is cut off) outputs = model.generate( **inputs, max_new_tokens=50, num_beams=5, early_stopping=True ) # 4. Decode result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result)