Data — MIS 542 Spam Detector

Data Structure

The dataset is a CSV file with two columns: class (the label) and message (the raw text). It contains 5,572 labeled messages in total.

Column	Type	Description
`class`	string	Label — either `"valid"` or `"spam"`
`message`	string	Raw text of the email or SMS message

Sample Rows

#	Class	Message
1	valid	Erm... I'm not sure. I'll let you know.
2	valid	Are you free for lunch tomorrow?
3	valid	I'll be there by 7. Don't start without me.
4	spam	WINNER!! You've been selected. Claim $1000: www.claimprize.com
5	spam	FREE entry in 2 a weekly competition to win FA Cup final tkts!
6	spam	Urgent! Call 09061743810 NOW. £150 prize guaranteed!

Class Distribution

Valid — 4,825 (86.6%) Spam — 747 (13.4%)

Preprocessing Pipeline

Raw messages contain noise — uppercase letters, URLs, numbers, punctuation, and filler words — that would confuse a model. Six cleaning steps are applied in sequence before any feature extraction.

01 Lowercase

Normalizes the text so "WINNER" and "winner" are treated identically.

text = str(text).lower()

Before

WINNER!! You've been selected.

After

winner!! you've been selected.

02 Remove Hyperlinks

Strips URLs and web addresses that add no semantic value.

text = re.sub(r"http\S+|www\S+", "", text)

Before

winner!! you've been selected. click: www.claimprize.com to win now!!!

After

winner!! you've been selected. click: to win now!!!

03 Remove Numbers

Digits like phone numbers and prize amounts are removed.

text = re.sub(r"\d+", "", text)

Before

winner!! you've been selected. click: to win $1000 now!!!

After

winner!! you've been selected. click: to win $ now!!!

04 Remove Special Chars

Strips punctuation, symbols, and anything non-alphabetic.

text = re.sub(r"[^a-z\s]", "", text)

Before

winner!! you've been selected. click: to win $ now!!!

After

winner youve been selected click to win now

05 Normalize Whitespace

Collapses multiple spaces and trims leading/trailing whitespace.

text = re.sub(r"\s+", " ", text).strip()

Before

winner youve been selected click to win now

After

winner youve been selected click to win now

06 Remove Stop Words

Removes common filler words (the, a, to, you, etc.) that carry no discriminative signal.

tokens = [t for t in text.split() if t not in STOP_WORDS]

Before

winner youve been selected click to win now

After

winner youve selected click win

Complete preprocess() Function

def preprocess(text: str) -> str:
    text = str(text).lower()                          # Step 1: lowercase
    text = re.sub(r"http\S+|www\S+", "", text)    # Step 2: remove hyperlinks
    text = re.sub(r"\d+", "", text)               # Step 3: remove numbers
    text = re.sub(r"[^a-z\s]", "", text)          # Step 4: remove special characters
    text = re.sub(r"\s+", " ", text).strip()     # Step 5: normalize whitespace
    tokens = [t for t in text.split() if t not in STOP_WORDS]  # Step 6: remove stop words
    return " ".join(tokens)
    

Text Transformation — TF-IDF

After cleaning, each message must be converted into a numerical vector that a machine learning model can process. This project uses TF-IDF (Term Frequency–Inverse Document Frequency).

Term Frequency (TF)

Measures how often a word appears in a message. A word appearing 5 times in a 20-word message has TF = 0.25. Frequent words in a message are weighted higher.

Inverse Document Frequency (IDF)

Penalizes words that appear in many documents. A word in 90% of messages gets a low IDF; a rare word like "claim" gets a high IDF.

Why TF-IDF over Bag of Words?

Bag of Words counts raw word occurrences, giving common words like "the" and "is" equal weight to important spam signals like "winner" or "claim". TF-IDF down-weights common words automatically, making the features more discriminative.

Implementation

from sklearn.feature_extraction.text import TfidfVectorizer

# Fit on all cleaned messages, transform to sparse matrix
vectorizer = TfidfVectorizer(max_features=5000)
X_all = vectorizer.fit_transform(df["clean"])
    

max_features=5000

Limits the vocabulary to the 5,000 most informative terms across the entire corpus. This reduces dimensionality, prevents overfitting to rare noise words, and keeps the feature matrix manageable in memory.