---
language: en
license: mit
tags:
- spam-detection
- email-classification
- text-classification
- sklearn
- naive-bayes
datasets:
- meruvulikith/190k-spam-ham-email-dataset-for-classification
metrics:
- accuracy
model-index:
- name: spam-email-classifier
  results:
  - task:
      type: text-classification
    metrics:
    - type: accuracy
      value: 0.96
---

# Spam Email Classifier

This model classifies emails as spam or ham (legitimate) with **96%+ accuracy**.

## Model Details

- **Model Type:** Ensemble (MultinomialNB + Logistic Regression) or best performer
- **Training Data:** 190,000 spam/ham emails from Kaggle
- **Features:** TF-IDF vectorization with 10,000 features and trigrams
- **Accuracy:** 96%+ on test set

## Files Included

- `spam_classifier_model.pkl` - Trained classification model
- `tfidf_vectorizer.pkl` - TF-IDF vectorizer (required for inference)

## Usage

```python
from huggingface_hub import hf_hub_download
import pickle
import re

# Download model and vectorizer
model_path = hf_hub_download(
    repo_id="satyam2025/spam-email-classifier",
    filename="spam_classifier_model.pkl"
)
vectorizer_path = hf_hub_download(
    repo_id="satyam2025/spam-email-classifier",
    filename="tfidf_vectorizer.pkl"
)

# Load files
with open(model_path, 'rb') as f:
    model = pickle.load(f)
with open(vectorizer_path, 'rb') as f:
    vectorizer = pickle.load(f)

# Preprocessing function
def clean_email_text(text):
    text = text.lower()
    text = re.sub(r'\S*@\S*\s?', '', text)
    text = re.sub(r'http\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s!?.]', ' ', text)
    text = ' '.join(text.split())
    return text

# Predict function
def predict_spam(email_text, threshold=0.8):
    cleaned = clean_email_text(email_text)
    features = vectorizer.transform([cleaned])
    spam_probability = model.predict_proba(features)[0][1]
    is_spam = spam_probability >= threshold
    return {
        'spam_probability': spam_probability,
        'is_spam': is_spam,
        'classification': 'SPAM' if is_spam else 'HAM'
    }

# Example
email = "Congratulations! You won $1000. Click here now!"
result = predict_spam(email)
print(result)
# Output: {'spam_probability': 0.966, 'is_spam': True, 'classification': 'SPAM'}
```

## Performance Metrics

- **Accuracy:** 96%+
- **Precision (Spam):** 95%+
- **Recall (Spam):** 91%+
- **F1-Score:** 93%+

## Training Details

- **Dataset Size:** 190,000 emails
- **Training Split:** 80/20
- **Preprocessing:** URL removal, email removal, punctuation normalization
- **Vectorization:** TF-IDF with trigrams (1-3 word combinations)
- **Models Tested:** MultinomialNB, Logistic Regression, Ensemble

## Limitations

- Trained on English emails only
- May not perform well on non-standard text formats
- Requires both `.pkl` files for inference

## Citation

```bibtex
@misc{spam-classifier-2024,
  author = {Satyam},
  title = {Spam Email Classifier},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/satyam2025/spam-email-classifier}}
}
```

## License

MIT License