The slide presents a workflow for spam email detection using the Enron dataset (5,723 spam and 13,279 ham emails). It covers text preprocessing with NLTK/scikit-learn, TF-IDF feature extraction, Multinomial Naive Bayes training, and testing with 95-98% accuracy, precision, recall, and F1-score.
Methodology
{ "headers": [ "Phase", "Description", "Key Techniques" ], "rows": [ [ "Data Collection", "Gather spam (5,723) and ham (13,279) emails", "GitHub Enron Dataset download" ], [ "Text Preprocessing", "Clean text: lowercasing, remove stopwords, punctuation, stemming", "Python NLTK, scikit-learn" ], [ "Feature Extraction", "Convert text to numerical vectors", "TF-IDF Vectorizer" ], [ "Model Training", "Train classifier on labeled data", "Multinomial Naive Bayes" ], [ "Testing", "Evaluate on test set", "Accuracy, Precision, Recall, F1-Score (~95-98%)" ] ] }
Source: Data collection → Text preprocessing → Feature extraction → Model training → Testing
Speaker Notes
Walk through the sequential steps of the methodology, emphasizing how each phase builds on the previous one for effective spam detection. Keep explanations simple and highlight key techniques.