Semester 252 · 2026

Email Spam Detector.

A machine learning system that classifies email messages as valid or spam using four trained classification models. Built for MIS 542 — Spam Detection Project.

Try the Live Demo

Project Goal

The goal of this project is to develop a machine learning model capable of automatically classifying email and SMS messages as either valid or spam. Spam messages are unsolicited, often fraudulent communications that clutter inboxes and pose security risks to recipients.

By training on a labeled dataset of real-world messages, the system learns distinguishing linguistic patterns — such as urgency triggers, prize claims, and unusual punctuation — that are strongly associated with spam. The resulting model can then filter new, unseen messages in real time through a web interface.

Project Scope

This project encompasses four distinct phases:

01 Data Preparation Load, explore, and preprocess a labeled SMS/email dataset through a 6-step text cleaning pipeline.
02 Feature Engineering Transform cleaned text into numerical feature vectors using TF-IDF with a vocabulary of 5,000 terms.
03 Model Training Train and evaluate four classification models (KNN, Naive Bayes, Logistic Regression, Random Forest) with and without SMOTE oversampling.
04 Deployment Serve the trained models through an interactive web interface that classifies messages in real time.

Dataset at a Glance

The dataset contains 5,572 labeled messages sourced from real SMS and email communications.

Total Messages

5,572

Valid Messages

4,825

86.6% of total

Spam Messages

747

13.4% of total

Class Distribution

Valid — 86.6% Spam — 13.4%

The dataset is heavily imbalanced — only 13.4% of messages are spam. This motivates the use of SMOTE oversampling during model training.

Explore the Full Report