Model Card for Model ID

This model takes in doctor's notes as inputs and summarizes them into patient-friendly notes.

Model Details

1. Introduction

Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahin Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key to avoid noise from other unrelated training data as well.

2. Data

After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model with a system and user prompt as included in the "Prompt Format" section below. I did not use a random seed - I used a subset of the PMC-patients dataset (1000 rows). I have included the full dataset for users to explore other splits as well as the train, validation and test split data used for this model in this repo. I set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.

3. Methodology

Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context of the training task (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA), the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).

4. Evaluation

Model	MMLU Philosophy	medqa_4options	xsum	test split
google/gemma-3-4b-it	0.63	0.33	0.79	0.87
doctor-note-summarization	0.6	0.4	0.79	0.90
Qwen/Qwen2.5-3B-Instruct	0.67	0.53	0.77	0.84
mistralai/Mistral-7B-Instruct-v0.2	0.77	0.37	0.77	0.85

To benchmark the model, general, medical reasoning, and summarization specific benchmarks were utilized. For the general benchmark, Massive Multitask Language Understanding (MMLU) Philosophy (Caballar & Stryker, 2025) was chosen and for the summarization specific benchmark, Extreme Summarization (XSum) was used to evaluate the model’s ability to generate effective summaries/abstracts when given long inputs that may be unstructured or have lots of technical language (Narayan et al., 1970). My general benchmark plan was as follows:

medqa_4options (Domain / task specific): Assess model’s ability to perform medical reasoning when given complex multiple choice questions and be able to understand medical information

MMLU Philosophy: Assess model’s reasoning capability and breadth of knowledge across philosophy (Chugani, 2025) with multiple choice questions

XSum: Assess model’s ability to generate concise and accurate summaries (1 sentence) given long inputs without simply extracting information from the input (Narayan et al., 1970). This focuses on BBC News articles. The full XSum dataset had numerous rows, so a subset of 1000 rows was used for benchmark purposes. The subset that was used has been uploaded to the files section of this repository.

I chose Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.2 as the comparison models since they are similar in size to the google/gemma-3-4b-it baseline model I used. Furthermore, these models also can perform summarization and work with long context inputs as well, which is related to my training task. I was initially considering these models along with the baseline model when deciding which one performed best with few shot prompting and went with gemma-3 since it provided the most succinct outputs. Looking at the table above, the doctor-note-summarization model performed the best on the XSum and test split data compared to the baseline and comparison models as expected. However, it did have slightly lower accuracy on the medqa_4options compared to the Qwen model and slightly lower accuracy on MMLU philosophy compared to both comparison models. The lower accuracy on medqa_4options and MMLU philosophy could be due to the fact that the model is more specialized in just summarizing doctor’s notes and understanding clinical data/ summarization patterns. With this, the model isn’t focusing specifically on diagnosing a patient, which would help more with reasoning, and also, becomes more niche making it perform a little worse on MMLU philosophy as it is medical note centered.

As a note, I used BERTscore as the evaluation metric for XSum and the test split, while I used accuracy for MMLU Philosophy and medqa_4options (lm eval benchmarks).

5. Usage and Intended Uses

This model is meant to be used to summarize doctor's notes. More specifically, this model was trained on condensing long doctor's notes into 3-5 sentence patient friendly summaries. It could be used for other long medical text, similar in length to a doctor's note, however, the model is the most familiar with the language in a note describing the patient's diagnosis, health concerns, next steps, and health progression. Overall, the model is meant to be used to when a patient receives a long visit note from the doctor that is overwhelming and filled with medical jargon. By passing that note as input, a quick synopsis of the doctor's notes that omits much medical jargon and just keeps key details is returned. This can be easily used to understand the condition and next steps for yourself or a family member.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("hamsinimk/doctor_note_summarization_llm")
model = AutoModelForCausalLM.from_pretrained(
    "hamsinimk/doctor_note_summarization_llm",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

6. Prompt Format

The prompt format is two-fold as both a system and user prompt are used for this model. This way, the model is given clear instructions on how to act to actually word the output (using the system prompt) and then, also guided on the length and purpose of the output (user prompt). The format for what was used to train the model and generate summaries for the test data for evaluation is shown below as an example.

System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long.\nThe goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon.

User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding.

Example Doctor's Note for Prompt:

A 70-year-old woman presented in November 2017 to the Emergency Department at Skåne University Hospital, Sweden, due to the rapid onset of fever, shivers, and a suspected skin infection. She had a previous medical history of left-sided ductal breast cancer with lymph node involvement in 1999, which was treated chronologically with neoadjuvant chemotherapy, partial mastectomy, axillary lymph node dissection, and radiation therapy. In addition, in 2001, a right-sided localised ductal breast cancer in situ was identified and was treated surgically with a partial mastectomy. Secondary to her lymph node dissection, she developed lymphoedema of her left arm, which had been continuously treated with compression stockings. The patient was on treatment with an ACE inhibitor and a beta-blocker due to hypertension, and in addition, she had a known systolic murmur, characterized as physiological, as transthoracic echocardiographs in 2011 and 2017 were normal. Since her surgery in 1999, on a total of six occasions prior to her last and seventh visit, of which the first episode occurred in 2008, she had been treated for erysipelas in her left upper arm. The presentation had always been sudden with spiking fever and erythema spreading in approximately the same localisation. Interestingly, on all three out of the three occasions where a blood culture has been drawn on presentation with erysipelas, the cultures have shown growth of a bacterium belonging to the S. mitis group. These first two isolates also had similar MIC values for penicillin of 0.064 and 0.125 mg/L, for vancomycin of 0.25 and 0.5 mg/L, and for gentamicin of 2 and 2 mg/L (). In addition, they were both sensitive to clindamycin.\\nOn the present visit, she once again had a sharply demarcated, warm, swollen, and painful erythema measuring approximately 7 × 15 cm in the lymphoedematous area on her left upper arm. No local portal of bacterial entry was found. Vital parameters showed a temperature of 38.0°C, respiratory rate of 16 breaths/min, O2 saturation of 96% on room air, heart rate of 80 beats/min, and blood pressure of 120/70 mmHg. On physical examination, a grade II systolic murmur was heard with punctum maximum I2 dexter. She had no signs of septic emboli, oral examination showed no signs of infection, and examination of lymph nodes was normal. Possibly due to her quick presentation, that is, less than 6 hours from the onset of symptoms, her laboratory results were normal with a white blood cell count of 8.4 ∗ 109/L, platelets of 263 ∗ 109/L, and hemoglobin of 147 g/L. Her CRP was 12 mg/L. She was clinically diagnosed with erysipelas, and due to previous bacteraemia with the S. mitis group in relation to erysipelas and the presence of a systolic murmur, blood cultures were drawn and she was treated with one dose of intravenous penicillin (3g≈5 million IU) followed by an oral penicillin (1g≈1.6 million IU) three times daily, for seven days. Once again, now for the third time, the two blood cultures showed growth of a bacterium belonging to the S. mitis group. The MIC value for penicillin was 0.125 mg/L, for vancomycin 1 mg/L, and for gentamicin 16 mg/L (). Similar to the two previous isolates, it was also sensitive to clindamycin. Her treatment was prolonged for 10 days, and a follow-up visit was arranged. Repeat blood cultures were drawn 14 days after discontinuation of antibiotics and they were negative. To prevent further infections, she has once again been referred to the lymphoedema outpatient clinic as well as to the dentist office. On follow-up, thereafter, the patient had no sequelae to her infection, and she gave informed consent for this case report to be published.\\nThe three blood isolates, one analysed in 2015 and two in 2017 (15 and 8 months apart), were initially subgrouped to S. mitis/S. oralis/S. pseudopneumoniae of the S. mitis group by combining the MALDI-TOF MS results (MALDI Biotyper, Bruker) with the information that the three stains were resistant to optochin. To allow a more detailed comparison, the three stored isolates were reanalysed and now ethanol/formic acid extractions were performed on the strains, and the updated and improved Bruker MALDI Biotyper database (DB-7311 MSP Library) was used for the MALDI Biotyper analysis. In addition to the standard log (score), weighted list (scores) was also calculated []. S. mitis was the best match for both the first and second isolates when both log (score) and list (score) were calculated. For the third isolate, the best match was S. oralis for both types of scores (). Next, the mass spectra of the three isolates were inspected manually. All three strains showed the specific peak 6839.1 m/z which is associated with S mitis and S. oralis strains, but only the third isolate showed the specific peak 5822.5 m/z which is associated with S. oralis () []. In addition, no peak profiles typical for S. pneumoniae and S. pseudopneumoniae could be detected in the three isolates [, ]. These results further support that the first two isolates are S. mitis and the third isolate is S. oralis. Many differences were seen in the mass spectra of the third isolate (S. oralis) compared to the first two (S. mitis). On the other hand, no clear differences in the spectra between the first and second isolate could be seen, and one can therefore not exclude that they belong to the same clone.

7. Expected Output Format

The expected output format is 3-5 sentence summary that uses patient-friendly and layman language for ease of understanding. The output will keep key information from the doctor's notes to ensure that critical details regarding the patient's health are still disclosed, but are interpretable.

Below is an example of the expected output format for a summary:

A 70-year-old woman was admitted to the hospital because she had a sudden fever, chills, and a skin infection on her arm. She had a history of breast cancer and had experienced similar skin infections several times before. Blood tests showed that the infection was caused by a bacteria from the S. mitis group, which had caused problems in the past. She was treated with antibiotics, and after a few weeks, the infection cleared up. To prevent future infections, she was referred to specialists for her lymphoedema and to the dentist.

8. Limitations

Some limitations of this model include:

This model does not provide medical advice. It just summarizes given doctor's notes into more readable summaries for patients. So, only doctor's notes should be given as input to get an accurate and properly formatted output.
The model was trained to keep key details, but remove overwhelming medical jargon. In this process, it is important to note that all "key" details may not be retained in the output summary. It is important to use the output summary to understand the doctor's note better, but should not be used to make medical decisions. In short, the summary does not replace the doctor's note.
For any niche conditions/diagnoses that weren't covered in the training data, there is a risk of hallucination as the model may not be specialized enough to accurately output a summary.
Since the model is specifically trained on doctor note and summary pairs, it may not perform as well on reasoning tasks or general non STEM related tasks as it is niche to medical note summarization.

Citations

APA:

Caballar, R., & Stryker, C. (2025, July 22). What are LLM benchmarks?. IBM. https://www.ibm.com/think/topics/llm-benchmarks

Chugani, V. (2025, July 21). How to understand MMLU scores: The ‘SAT test’ for AI models. Statology. https://www.statology.org/how-to-understand-mmlu-scores-the-sat-test-for-ai-models/

Narayan, S., Cohen, S. B., & Lapata, M. (1970, January 1). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ACL Anthology. https://aclanthology.org/D18-1206/

Sahin Ahmed, D. S. (2024, October 29). The Common Failure Points of LLM Rag Systems and how to overcome them. Medium. https://medium.com/@sahin.samia/the-common-failure-points-of-llm-rag-systems-and-how-to-overcome-them-926d9090a88f

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamsinimk/doctor_note_summarization_llm

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

(476)

this model