Data

Dataset structure, the 6-step text preprocessing pipeline, and TF-IDF feature extraction.

Data Structure

The dataset is a CSV file with two columns: class (the label) and message (the raw text). It contains 5,572 labeled messages in total.

Column Type Description
class string Label — either "valid" or "spam"
message string Raw text of the email or SMS message

Sample Rows

# Class Message
1 valid Erm... I'm not sure. I'll let you know.
2 valid Are you free for lunch tomorrow?
3 valid I'll be there by 7. Don't start without me.
4 spam WINNER!! You've been selected. Claim $1000: www.claimprize.com
5 spam FREE entry in 2 a weekly competition to win FA Cup final tkts!
6 spam Urgent! Call 09061743810 NOW. £150 prize guaranteed!

Class Distribution

Valid — 4,825 (86.6%) Spam — 747 (13.4%)

Preprocessing Pipeline

Raw messages contain noise — uppercase letters, URLs, numbers, punctuation, and filler words — that would confuse a model. Six cleaning steps are applied in sequence before any feature extraction.

01 Lowercase

Normalizes the text so "WINNER" and "winner" are treated identically.

text = str(text).lower()

Before

WINNER!! You've been selected.

After

winner!! you've been selected.
02 Remove Hyperlinks

Strips URLs and web addresses that add no semantic value.

text = re.sub(r"http\S+|www\S+", "", text)

Before

winner!! you've been selected. click: www.claimprize.com to win now!!!

After

winner!! you've been selected. click: to win now!!!
03 Remove Numbers

Digits like phone numbers and prize amounts are removed.

text = re.sub(r"\d+", "", text)

Before

winner!! you've been selected. click: to win $1000 now!!!

After

winner!! you've been selected. click: to win $ now!!!
04 Remove Special Chars

Strips punctuation, symbols, and anything non-alphabetic.

text = re.sub(r"[^a-z\s]", "", text)

Before

winner!! you've been selected. click: to win $ now!!!

After

winner youve been selected click to win now
05 Normalize Whitespace

Collapses multiple spaces and trims leading/trailing whitespace.

text = re.sub(r"\s+", " ", text).strip()

Before

winner youve been selected click to win now

After

winner youve been selected click to win now
06 Remove Stop Words

Removes common filler words (the, a, to, you, etc.) that carry no discriminative signal.

tokens = [t for t in text.split() if t not in STOP_WORDS]

Before

winner youve been selected click to win now

After

winner youve selected click win

Complete preprocess() Function

def preprocess(text: str) -> str: text = str(text).lower() # Step 1: lowercase text = re.sub(r"http\S+|www\S+", "", text) # Step 2: remove hyperlinks text = re.sub(r"\d+", "", text) # Step 3: remove numbers text = re.sub(r"[^a-z\s]", "", text) # Step 4: remove special characters text = re.sub(r"\s+", " ", text).strip() # Step 5: normalize whitespace tokens = [t for t in text.split() if t not in STOP_WORDS] # Step 6: remove stop words return " ".join(tokens)

Text Transformation — TF-IDF

After cleaning, each message must be converted into a numerical vector that a machine learning model can process. This project uses TF-IDF (Term Frequency–Inverse Document Frequency).

Term Frequency (TF)

Measures how often a word appears in a message. A word appearing 5 times in a 20-word message has TF = 0.25. Frequent words in a message are weighted higher.

Inverse Document Frequency (IDF)

Penalizes words that appear in many documents. A word in 90% of messages gets a low IDF; a rare word like "claim" gets a high IDF.

Why TF-IDF over Bag of Words?

Bag of Words counts raw word occurrences, giving common words like "the" and "is" equal weight to important spam signals like "winner" or "claim". TF-IDF down-weights common words automatically, making the features more discriminative.

Implementation

from sklearn.feature_extraction.text import TfidfVectorizer # Fit on all cleaned messages, transform to sparse matrix vectorizer = TfidfVectorizer(max_features=5000) X_all = vectorizer.fit_transform(df["clean"])
max_features=5000

Limits the vocabulary to the 5,000 most informative terms across the entire corpus. This reduces dimensionality, prevents overfitting to rare noise words, and keeps the feature matrix manageable in memory.