🛡️ Shield 82M

Welcome to Shield 82M, a model designed to filter PII out of texts in any language.

Classes

This model has the following PII classes:

['O', 'ACCOUNTNAME', 'ACCOUNTNUMBER', 'AGE', 'AMOUNT', 'BIC', 'BITCOINADDRESS', 'BUILDINGNUMBER', 'CITY', 'COMPANYNAME', 'COUNTY', 'CREDITCARDCVV', 'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'CURRENCY', 'CURRENCYCODE', 'CURRENCYNAME', 'CURRENCYSYMBOL', 'DATE', 'DOB', 'EMAIL', 'ETHEREUMADDRESS', 'EYECOLOR', 'FIRSTNAME', 'GENDER', 'HEIGHT', 'IBAN', 'IP', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE', 'JOBTYPE', 'LASTNAME', 'LITECOINADDRESS', 'MAC', 'MASKEDNUMBER', 'MIDDLENAME', 'NEARBYGPSCOORDINATE', 'ORDINALDIRECTION', 'PASSWORD', 'PHONEIMEI', 'PHONENUMBER', 'PIN', 'PREFIX', 'SECONDARYADDRESS', 'SEX', 'SSN', 'STATE', 'STREET', 'TIME', 'URL', 'USERAGENT', 'USERNAME', 'VEHICLEVIN', 'VEHICLEVRM', 'ZIPCODE']

Base model

This model is based on distilroberta-base.

Examples

The model has an accuracy score of ~96% (0.961206).
Here are a few examples:

Test with name, email and phone

Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].

Basic test

Original: I live in Cambridge
Protected: I live in [ADDRESS]

French test (multilingual)

Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78.
Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE].

Quickstart

To use this model, just download use.py from this repo and launch it:

mkdir Shield-82M
cd Shield-82M
wget https://huggingface.co/LH-Tech-AI/Shield-82M/resolve/main/use.py
python3 use.py

This outputs something like:

Loading Shield-82M from LH-Tech-AI/Shield-82M...

Loading weights: 100%
 103/103 [00:00<00:00, 773.65it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight]

Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].

To use it with your own text, you'll have to adjust this line of code in use.py:

sample = "My name is John Doe. Email: john@example.com. Phone: +49 123 45678."

Training data

This model was trained on the first 20,000 samples of ai4privacy/pii-masking-200k for 3 epochs.

Training details

Epochs: 3
Max Lenght: 512
Base model: distilroberta-base
Data: first 20,000 samples of ai4privacy/pii-masking-200k
GPU: 2x Kaggle T4
Training time: 06:38 min
Engine: HF Transformers

The following table shows the training process:

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	1.048266	0.250184	0.904065	0.932844	0.918229	0.949456
2	0.257664	0.193614	0.939548	0.949651	0.944572	0.959521
3	0.199425	0.181754	0.939833	0.952215	0.945983	0.961206

You can find the full training code in train.ipynb. Runs on 2x Kaggle T4 in ~7mins.

Downloads last month: 79

Safetensors

Model size

81.6M params

Tensor type

F32

Model tree for LH-Tech-AI/Shield-82M

Base model

distilbert/distilroberta-base

Finetuned

(761)

this model

LH-Tech-AI
/

Shield-82M

🛡️ Shield 82M

Classes

Base model

Examples

Test with name, email and phone

Basic test

French test (multilingual)

Quickstart

Training data

Training details

Model tree for LH-Tech-AI/Shield-82M

Dataset used to train LH-Tech-AI/Shield-82M