ssheroz
/

spam-email-classifier-roberta-r8

+---
+language:
+- en
+license: mit
+tags:
+- spam-classification
+- email-classification
+- lora
+- peft
+- text-classification
+- transformers
+datasets:
+- purusinghvi/email-spam-classification-dataset
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+- roc-auc
+base_model: FacebookAI/roberta-base
+library_name: peft
+pipeline_tag: text-classification
+---
+# Spam Email Classifier - RoBERTa-base with LoRA (r=8)
+This model is a LoRA adapter for spam email classification, fine-tuned on the [Email Spam Classification Dataset](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset) with 83,448 emails.
+## Model Description
+- **Base Model**: FacebookAI/roberta-base
+- **LoRA Rank**: 8
+- **LoRA Alpha**: 16
+- **Task**: Binary Text Classification (Spam/Ham)
+- **Training Dataset**: 83,448 emails (66,758 training samples)
+- **Trainable Parameters**: 1,919,234 (1.52% of total)
+- **Total Parameters**: 126,566,404
+## Performance
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | 99.45% |
+| **Precision** | 99.52% |
+| **Recall** | 99.43% |
+| **F1 Score** | 99.48% |
+| **ROC-AUC** | 0.9989 |
+| **PR-AUC** | 0.9987 |
+**Training Time**: 544.92 minutes (~9.1 hours)
+## Usage
+### Method 1: Using the Inference Script (Recommended)
+Download the inference script and config from the [GitHub repository](https://github.com/sherozshaikh/spam-email-classification-lora/tree/main/inference):
+```bash
+# Download inference files
+wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference.py
+wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference_config.yaml
+# Update inference_config.yaml with this model:
+# base_model_name: "FacebookAI/roberta-base"
+# adapter_path: "ssheroz/spam-email-classifier-roberta-r8"
+```
+**Python API:**
+```python
+from inference import SpamClassifier
+# Initialize classifier
+classifier = SpamClassifier(config_path="inference_config.yaml")
+# Classify single email
+email = "Subject: URGENT! You've won $1,000,000! Click here to claim now!"
+result = classifier.predict_single(email)
+print(f"Prediction: {result['label']}")
+print(f"Confidence: {result['confidence']:.2%}")
+print(f"Probabilities: {result['probabilities']}")
+```
+**Command Line:**
+```bash
+# Single email prediction
+python inference.py --text "Subject: Meeting tomorrow at 2pm"
+# Batch prediction from CSV
+python inference.py --input_file emails.csv --output_file predictions.csv
+```
+### Method 2: Direct Usage with Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from peft import PeftModel
+import torch
+# Load base model and tokenizer
+base_model_name = "FacebookAI/roberta-base"
+tokenizer = AutoTokenizer.from_pretrained(base_model_name)
+base_model = AutoModelForSequenceClassification.from_pretrained(
+    base_model_name,
+    num_labels=2,
+    problem_type="single_label_classification"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(base_model, "ssheroz/spam-email-classifier-roberta-r8")
+model.eval()
+# Inference
+text = "Subject: URGENT! You've won $1,000,000! Click here now!"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    outputs = model(**inputs)
+    probabilities = torch.softmax(outputs.logits, dim=1)
+    prediction = torch.argmax(probabilities, dim=1).item()
+label = "SPAM" if prediction == 1 else "HAM"
+confidence = probabilities[0][prediction].item()
+print(f"Prediction: {label} (Confidence: {confidence:.2%})")
+```
+## Training Details
+### Hyperparameters
+- **Epochs**: 2
+- **Learning Rate**: 2e-4
+- **Batch Size**: 16
+- **Optimizer**: AdamW with weight decay (0.01)
+- **Scheduler**: Cosine with warmup (10% warmup ratio)
+- **Gradient Clipping**: 1.0
+- **Mixed Precision**: FP16
+- **Early Stopping**: Patience=2
+### LoRA Configuration
+- **Rank (r)**: 8
+- **Alpha**: 16
+- **Dropout**: 0.1
+- **Target Modules**: query, key, value, dense (all attention layers)
+### Data Split
+- **Train**: 66,758 samples (80%)
+- **Validation**: 8,345 samples (10%)
+- **Test**: 8,345 samples (10%)
+## Limitations
+- Trained primarily on English emails
+- Performance may degrade on domain-specific spam (e.g., social media, SMS)
+- Requires periodic retraining for evolving spam patterns
+- False positives (legitimate emails marked as spam) can occur with unusual email patterns
+## Ethical Considerations
+- False positives may cause users to miss important emails
+- Should be used as part of a larger system with human oversight for critical applications
+- Regular monitoring and updates recommended to maintain effectiveness
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{shaikh2025spamclassifier,
+  author = {Sheroz Shaikh},
+  title = {Spam Email Classification using LoRA Fine-tuned Transformers},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/ssheroz/spam-email-classifier-roberta-r8}}
+}
+```
+## Related Models
+- [ELECTRA r=4](https://huggingface.co/ssheroz/spam-email-classifier-electra-r4)
+- [ELECTRA r=8](https://huggingface.co/ssheroz/spam-email-classifier-electra-r8)
+- [RoBERTa r=4](https://huggingface.co/ssheroz/spam-email-classifier-roberta-r4)
+## GitHub Repository
+**Full training code, analysis, and inference scripts**: [spam-email-classification-lora](https://github.com/sherozshaikh/spam-email-classification-lora)
+## License
+MIT License - See [LICENSE](https://github.com/sherozshaikh/spam-email-classification-lora/blob/main/LICENSE) for details.
+## Contact
+- **GitHub**: [@sherozshaikh](https://github.com/sherozshaikh)
+- **HuggingFace**: [@ssheroz](https://huggingface.co/ssheroz)