Model description

NeuroTrialNER_BioLinkBERT is a fine-tuned BERT model designed for Named Entity Recognition (NER) of interventions and disease entities in clinical trial registries. It has been trained to recognize multiple entity types, including drugs (DRUG), conditions/diseases (COND), behavioural interventions (BEH), surgical interventions (SURG), physical interventions (PHYS), radiotherapy (RADIO), other interventions (OTHER), and control/comparator groups (CTRL). Specifically, this model is a BioLinkBERT-base model that was fine-tuned on NeuroTrialNER dataset.

Intended uses & limitations

How to use

You can use this model with Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("simonada/NeuroTrialNER_BioLinkBERT")
model = AutoModelForTokenClassification.from_pretrained("simonada/NeuroTrialNER_BioLinkBERT")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example_drug = "This trial examines atypical antipsychotic aripiprazole as an augmenting agent to antidepressant therapy in treatment-resistant depressed patients."
example_phys = "This study evaluates a home-based resistance exercise program in post-treatment breast cancer survivors."

ner_results_drug = nlp(example_drug)
print(ner_results_drug)

ner_results_drug = nlp(example_phys)
print(example_phys)

Limitations and bias

This model is limited by its training dataset of entity-annotated clinical trial registry records from a specific span of time and focused on the field of neuroscience. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.

⚠️ Please consider your use case, as this model had the best performance for drug and disease entities. However BioBERT was able to recognize better the other intervention types.

Training data

This model was fine-tuned on NeuroTrialNER dataset.

The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:

Abbreviation	Description
O	Outside of a named entity
B-DRUG	Beginning of a drug entity
I-DRUG	Inside of a drug entity
B-COND	Beginning of a condition (disease) entity
I-COND	Inside of a condition
B-BEH	Beginning of a behavioural intervention
I-BEH	Inside of a behavioural intervention
B-SURG	Beginning of a surgical intervention
I-SURG	Inside of a surgical intervention
B-PHYS	Beginning of a physical intervention
I-PHYS	Inside of a physical intervention
B-RADIO	Beginning of a radiotherapy intervention
I-RADIO	Inside of a radiotherapy intervention
B-OTHER	Beginning of other intervention
I-OTHER	Inside of other intervention
B-CTRL	Beginning of a control/comparator
I-CTRL	Inside of a control/comparator

Evaluation results

A strict match implies an exact match with the boundaries and entity type in the gold standard. A partial match requires the correct entity type and a significant character overlap between the predicted and target entities, assessed through a similarity ratio.

BioLinkBERT-base Performance

Entity Type	Exact (95% CI)	Partial (95% CI)
DRUG	0.83 (0.77, 0.89)	0.90 (0.85, 0.95)
CONDITION	0.77 (0.73, 0.81)	0.85 (0.82, 0.89)
CONTROL	0.69 (0.59, 0.78)	0.85 (0.78, 0.92)
PHYSICAL	0.41 (0.31, 0.50)	0.71 (0.64, 0.79)
BEHAVIOURAL	0.32 (0.21, 0.42)	0.68 (0.60, 0.77)
OTHER	0.39 (0.33, 0.46)	0.62 (0.56, 0.67)
SURGICAL	0.09 (0.00, 0.22)	0.29 (0.12, 0.46)
RADIOTHERAPY	0.00 (0.00, 0.00)	0.00 (0.00, 0.00)

NeuroTrialNER Dataset Statistics

You can read more about how this dataset was created in the NeuroTrialNER paper.

# of articles and entities per dataset (total with unique in parentheses)

Dataset	Articles	CONDITION	DRUG	OTHER	PHYSICAL	BEHAVIOURAL	SURGICAL	RADIOTHERAPY	CONTROL
Train	787	3524 (1068)	1205 (415)	1361 (749)	326 (191)	156 (105)	83 (58)	30 (13)	396 (138)
Dev	153	729 (191)	218 (62)	278 (164)	138 (63)	70 (48)	36 (24)	25 (7)	74 (37)
Test	153	683 (171)	213 (77)	167 (103)	130 (60)	91 (55)	54 (37)	22 (5)	84 (31)

BibTeX entry

@inproceedings{doneva-etal-2024-neurotrialner,
    title = "{N}euro{T}rial{NER}: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries",
    author = "Doneva, Simona Emilova  and
      Ellendorff, Tilia  and
      Sick, Beate  and
      Goldman, Jean-Philippe  and
      Cannon, Amelia Elaine  and
      Schneider, Gerold  and
      Ineichen, Benjamin Victor",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1050/",
    doi = "10.18653/v1/2024.emnlp-main.1050",
    pages = "18868--18890",
    abstract = "Extracting and aggregating information from clinical trial registries could provide invaluable insights into the drug development landscape and advance the treatment of neurologic diseases. However, achieving this at scale is hampered by the volume of available data and the lack of an annotated corpus to assist in the development of automation tools. Thus, we introduce NeuroTrialNER, a new and fully open corpus for named entity recognition (NER). It comprises 1093 clinical trial summaries sourced from ClinicalTrials.gov, annotated for neurological diseases, therapeutic interventions, and control treatments. We describe our data collection process and the corpus in detail. We demonstrate its utility for NER using large language models and achieve a close-to-human performance. By bridging the gap in data resources, we hope to foster the development of text processing tools that help researchers navigate clinical trials data more easily."
}

Downloads last month: 107

Model tree for simonada/NeuroTrialNER_BioLinkBERT

Base model

michiyasunaga/BioLinkBERT-base

Finetuned

(18)

this model

simonada
/

NeuroTrialNER_BioLinkBERT