ssheroz commited on
Commit
4968218
·
verified ·
1 Parent(s): 511841a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -3
README.md CHANGED
@@ -1,3 +1,198 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - spam-classification
7
+ - email-classification
8
+ - lora
9
+ - peft
10
+ - text-classification
11
+ - transformers
12
+ datasets:
13
+ - purusinghvi/email-spam-classification-dataset
14
+ metrics:
15
+ - accuracy
16
+ - f1
17
+ - precision
18
+ - recall
19
+ - roc-auc
20
+ base_model: FacebookAI/roberta-base
21
+ library_name: peft
22
+ pipeline_tag: text-classification
23
+ ---
24
+
25
+ # Spam Email Classifier - RoBERTa-base with LoRA (r=8)
26
+
27
+ This model is a LoRA adapter for spam email classification, fine-tuned on the [Email Spam Classification Dataset](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset) with 83,448 emails.
28
+
29
+ ## Model Description
30
+
31
+ - **Base Model**: FacebookAI/roberta-base
32
+ - **LoRA Rank**: 8
33
+ - **LoRA Alpha**: 16
34
+ - **Task**: Binary Text Classification (Spam/Ham)
35
+ - **Training Dataset**: 83,448 emails (66,758 training samples)
36
+ - **Trainable Parameters**: 1,919,234 (1.52% of total)
37
+ - **Total Parameters**: 126,566,404
38
+
39
+ ## Performance
40
+
41
+ | Metric | Score |
42
+ |--------|-------|
43
+ | **Accuracy** | 99.45% |
44
+ | **Precision** | 99.52% |
45
+ | **Recall** | 99.43% |
46
+ | **F1 Score** | 99.48% |
47
+ | **ROC-AUC** | 0.9989 |
48
+ | **PR-AUC** | 0.9987 |
49
+
50
+ **Training Time**: 544.92 minutes (~9.1 hours)
51
+
52
+ ## Usage
53
+
54
+ ### Method 1: Using the Inference Script (Recommended)
55
+
56
+ Download the inference script and config from the [GitHub repository](https://github.com/sherozshaikh/spam-email-classification-lora/tree/main/inference):
57
+
58
+ ```bash
59
+ # Download inference files
60
+ wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference.py
61
+ wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference_config.yaml
62
+
63
+ # Update inference_config.yaml with this model:
64
+ # base_model_name: "FacebookAI/roberta-base"
65
+ # adapter_path: "ssheroz/spam-email-classifier-roberta-r8"
66
+ ```
67
+
68
+ **Python API:**
69
+ ```python
70
+ from inference import SpamClassifier
71
+
72
+ # Initialize classifier
73
+ classifier = SpamClassifier(config_path="inference_config.yaml")
74
+
75
+ # Classify single email
76
+ email = "Subject: URGENT! You've won $1,000,000! Click here to claim now!"
77
+ result = classifier.predict_single(email)
78
+
79
+ print(f"Prediction: {result['label']}")
80
+ print(f"Confidence: {result['confidence']:.2%}")
81
+ print(f"Probabilities: {result['probabilities']}")
82
+ ```
83
+
84
+ **Command Line:**
85
+ ```bash
86
+ # Single email prediction
87
+ python inference.py --text "Subject: Meeting tomorrow at 2pm"
88
+
89
+ # Batch prediction from CSV
90
+ python inference.py --input_file emails.csv --output_file predictions.csv
91
+ ```
92
+
93
+ ### Method 2: Direct Usage with Transformers
94
+
95
+ ```python
96
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
97
+ from peft import PeftModel
98
+ import torch
99
+
100
+ # Load base model and tokenizer
101
+ base_model_name = "FacebookAI/roberta-base"
102
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name)
103
+ base_model = AutoModelForSequenceClassification.from_pretrained(
104
+ base_model_name,
105
+ num_labels=2,
106
+ problem_type="single_label_classification"
107
+ )
108
+
109
+ # Load LoRA adapter
110
+ model = PeftModel.from_pretrained(base_model, "ssheroz/spam-email-classifier-roberta-r8")
111
+ model.eval()
112
+
113
+ # Inference
114
+ text = "Subject: URGENT! You've won $1,000,000! Click here now!"
115
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
116
+
117
+ with torch.no_grad():
118
+ outputs = model(**inputs)
119
+ probabilities = torch.softmax(outputs.logits, dim=1)
120
+ prediction = torch.argmax(probabilities, dim=1).item()
121
+
122
+ label = "SPAM" if prediction == 1 else "HAM"
123
+ confidence = probabilities[0][prediction].item()
124
+
125
+ print(f"Prediction: {label} (Confidence: {confidence:.2%})")
126
+ ```
127
+
128
+ ## Training Details
129
+
130
+ ### Hyperparameters
131
+
132
+ - **Epochs**: 2
133
+ - **Learning Rate**: 2e-4
134
+ - **Batch Size**: 16
135
+ - **Optimizer**: AdamW with weight decay (0.01)
136
+ - **Scheduler**: Cosine with warmup (10% warmup ratio)
137
+ - **Gradient Clipping**: 1.0
138
+ - **Mixed Precision**: FP16
139
+ - **Early Stopping**: Patience=2
140
+
141
+ ### LoRA Configuration
142
+
143
+ - **Rank (r)**: 8
144
+ - **Alpha**: 16
145
+ - **Dropout**: 0.1
146
+ - **Target Modules**: query, key, value, dense (all attention layers)
147
+
148
+ ### Data Split
149
+
150
+ - **Train**: 66,758 samples (80%)
151
+ - **Validation**: 8,345 samples (10%)
152
+ - **Test**: 8,345 samples (10%)
153
+
154
+ ## Limitations
155
+
156
+ - Trained primarily on English emails
157
+ - Performance may degrade on domain-specific spam (e.g., social media, SMS)
158
+ - Requires periodic retraining for evolving spam patterns
159
+ - False positives (legitimate emails marked as spam) can occur with unusual email patterns
160
+
161
+ ## Ethical Considerations
162
+
163
+ - False positives may cause users to miss important emails
164
+ - Should be used as part of a larger system with human oversight for critical applications
165
+ - Regular monitoring and updates recommended to maintain effectiveness
166
+
167
+ ## Citation
168
+
169
+ If you use this model, please cite:
170
+
171
+ ```bibtex
172
+ @misc{shaikh2025spamclassifier,
173
+ author = {Sheroz Shaikh},
174
+ title = {Spam Email Classification using LoRA Fine-tuned Transformers},
175
+ year = {2025},
176
+ publisher = {HuggingFace},
177
+ howpublished = {\url{https://huggingface.co/ssheroz/spam-email-classifier-roberta-r8}}
178
+ }
179
+ ```
180
+
181
+ ## Related Models
182
+
183
+ - [ELECTRA r=4](https://huggingface.co/ssheroz/spam-email-classifier-electra-r4)
184
+ - [ELECTRA r=8](https://huggingface.co/ssheroz/spam-email-classifier-electra-r8)
185
+ - [RoBERTa r=4](https://huggingface.co/ssheroz/spam-email-classifier-roberta-r4)
186
+
187
+ ## GitHub Repository
188
+
189
+ **Full training code, analysis, and inference scripts**: [spam-email-classification-lora](https://github.com/sherozshaikh/spam-email-classification-lora)
190
+
191
+ ## License
192
+
193
+ MIT License - See [LICENSE](https://github.com/sherozshaikh/spam-email-classification-lora/blob/main/LICENSE) for details.
194
+
195
+ ## Contact
196
+
197
+ - **GitHub**: [@sherozshaikh](https://github.com/sherozshaikh)
198
+ - **HuggingFace**: [@ssheroz](https://huggingface.co/ssheroz)