language:-enlicense:apache-2.0base_model:distilbert/distilroberta-basetags:-token-classification-ner-pii-pii-detection-de-identification-privacy-healthcare-medical-clinical-phi-hipaa-pytorch-transformers-openmeddatasets:-nvidia/Nemotron-PIIpipeline_tag:token-classificationlibrary_name:transformersmetrics:-f1-precision-recallmodel-index:-name:OpenMed-PII-FastClinical-Base-82M-v1results:-task:type:token-classificationname:NamedEntityRecognitiondataset:name:nvidia/Nemotron-PII(test_strat)type:nvidia/Nemotron-PIIsplit:testmetrics:-type:f1value:0.9511name:F1(micro)-type:precisionvalue:0.9538name:Precision-type:recallvalue:0.9484name:Recallwidget:-text:>- Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street, Boston, MA 02108.example_title:ClinicalNotewithPII
OpenMed-PII-FastClinical-Base-82M-v1
PII Detection Model | 82M Parameters | Open Source
Model Description
OpenMed-PII-FastClinical-Base-82M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in text. This model identifies and classifies 54 types of sensitive information including names, addresses, SSNs, medical record numbers, and more.
Key Features
High Accuracy: Achieves strong F1 scores across diverse PII categories
Comprehensive Coverage: Detects 50+ entity types spanning personal, financial, medical, and contact information
Privacy-Focused: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations
Production-Ready: Optimized for real-world text processing pipelines
Performance
Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:
These entity types have lower performance and may benefit from additional post-processing:
Entity
F1
Precision
Recall
Support
pin
0.881
0.894
0.868
136
time
0.855
0.867
0.843
471
sexuality
0.822
0.763
0.892
83
gender
0.797
0.743
0.859
192
occupation
0.652
0.695
0.613
726
Supported Entity Types
This model detects 54 PII entity types organized into categories:
Identifiers (16 types)
Entity
Description
account_number
Account Number
api_key
Api Key
bank_routing_number
Bank Routing Number
certificate_license_number
Certificate License Number
credit_debit_card
Credit Debit Card
cvv
Cvv
employee_id
Employee Id
health_plan_beneficiary_number
Health Plan Beneficiary Number
mac_address
Mac Address
medical_record_number
Medical Record Number
...
and 6 more
Personal Info (14 types)
Entity
Description
age
Age
biometric_identifier
Biometric Identifier
blood_type
Blood Type
date_of_birth
Date Of Birth
education_level
Education Level
first_name
First Name
last_name
Last Name
gender
Gender
language
Language
occupation
Occupation
...
and 4 more
Contact Info (4 types)
Entity
Description
email
Email
phone_number
Phone Number
fax_number
Fax Number
url
Url
Location (6 types)
Entity
Description
city
City
coordinate
Coordinate
country
Country
county
County
state
State
street_address
Street Address
Network Info (3 types)
Entity
Description
device_identifier
Device Identifier
ipv4
Ipv4
ipv6
Ipv6
Temporal (3 types)
Entity
Description
date
Date
date_time
Date Time
time
Time
Organization (1 types)
Entity
Description
company_name
Company Name
Usage
Quick Start
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="openmed/OpenMed-PII-FastClinical-Base-82M-v1", aggregation_strategy="simple")
text = """Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.Contact: john.smith@email.com, Phone: (555) 123-4567.Address: 456 Oak Street, Boston, MA 02108."""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
De-identification Example
defredact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
Batch Processing
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "openmed/OpenMed-PII-FastClinical-Base-82M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Contact Dr. Jane Doe at jane.doe@hospital.org",
"Patient SSN: 987-65-4321, MRN: 12345678",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)