--- language: en license: mit tags: - spam-detection - email-classification - text-classification - sklearn - naive-bayes datasets: - meruvulikith/190k-spam-ham-email-dataset-for-classification metrics: - accuracy model-index: - name: spam-email-classifier results: - task: type: text-classification metrics: - type: accuracy value: 0.96 --- # Spam Email Classifier This model classifies emails as spam or ham (legitimate) with **96%+ accuracy**. ## Model Details - **Model Type:** Ensemble (MultinomialNB + Logistic Regression) or best performer - **Training Data:** 190,000 spam/ham emails from Kaggle - **Features:** TF-IDF vectorization with 10,000 features and trigrams - **Accuracy:** 96%+ on test set ## Files Included - `spam_classifier_model.pkl` - Trained classification model - `tfidf_vectorizer.pkl` - TF-IDF vectorizer (required for inference) ## Usage ```python from huggingface_hub import hf_hub_download import pickle import re # Download model and vectorizer model_path = hf_hub_download( repo_id="satyam2025/spam-email-classifier", filename="spam_classifier_model.pkl" ) vectorizer_path = hf_hub_download( repo_id="satyam2025/spam-email-classifier", filename="tfidf_vectorizer.pkl" ) # Load files with open(model_path, 'rb') as f: model = pickle.load(f) with open(vectorizer_path, 'rb') as f: vectorizer = pickle.load(f) # Preprocessing function def clean_email_text(text): text = text.lower() text = re.sub(r'\S*@\S*\s?', '', text) text = re.sub(r'http\S+|www\.\S+', '', text) text = re.sub(r'<.*?>', '', text) text = re.sub(r'[^a-zA-Z\s!?.]', ' ', text) text = ' '.join(text.split()) return text # Predict function def predict_spam(email_text, threshold=0.8): cleaned = clean_email_text(email_text) features = vectorizer.transform([cleaned]) spam_probability = model.predict_proba(features)[0][1] is_spam = spam_probability >= threshold return { 'spam_probability': spam_probability, 'is_spam': is_spam, 'classification': 'SPAM' if is_spam else 'HAM' } # Example email = "Congratulations! You won $1000. Click here now!" result = predict_spam(email) print(result) # Output: {'spam_probability': 0.966, 'is_spam': True, 'classification': 'SPAM'} ``` ## Performance Metrics - **Accuracy:** 96%+ - **Precision (Spam):** 95%+ - **Recall (Spam):** 91%+ - **F1-Score:** 93%+ ## Training Details - **Dataset Size:** 190,000 emails - **Training Split:** 80/20 - **Preprocessing:** URL removal, email removal, punctuation normalization - **Vectorization:** TF-IDF with trigrams (1-3 word combinations) - **Models Tested:** MultinomialNB, Logistic Regression, Ensemble ## Limitations - Trained on English emails only - May not perform well on non-standard text formats - Requires both `.pkl` files for inference ## Citation ```bibtex @misc{spam-classifier-2024, author = {Satyam}, title = {Spam Email Classifier}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/satyam2025/spam-email-classifier}} } ``` ## License MIT License