Dataset structure, the 6-step text preprocessing pipeline, and TF-IDF feature extraction.
The dataset is a CSV file with two columns: class
(the label) and message (the raw text).
It contains 5,572 labeled messages in total.
| Column | Type | Description |
|---|---|---|
class |
string | Label — either "valid" or "spam" |
message |
string | Raw text of the email or SMS message |
Sample Rows
| # | Class | Message |
|---|---|---|
| 1 | valid | |
| 2 | valid | |
| 3 | valid | |
| 4 | spam | |
| 5 | spam | |
| 6 | spam |
Class Distribution
Raw messages contain noise — uppercase letters, URLs, numbers, punctuation, and filler words — that would confuse a model. Six cleaning steps are applied in sequence before any feature extraction.
Normalizes the text so "WINNER" and "winner" are treated identically.
Before
After
Strips URLs and web addresses that add no semantic value.
Before
After
Digits like phone numbers and prize amounts are removed.
Before
After
Strips punctuation, symbols, and anything non-alphabetic.
Before
After
Collapses multiple spaces and trims leading/trailing whitespace.
Before
After
Removes common filler words (the, a, to, you, etc.) that carry no discriminative signal.
Before
After
Complete preprocess() Function
After cleaning, each message must be converted into a numerical vector that a machine learning model can process. This project uses TF-IDF (Term Frequency–Inverse Document Frequency).
Term Frequency (TF)
Measures how often a word appears in a message. A word appearing 5 times in a 20-word message has TF = 0.25. Frequent words in a message are weighted higher.
Inverse Document Frequency (IDF)
Penalizes words that appear in many documents. A word in 90% of messages gets a low IDF; a rare word like "claim" gets a high IDF.
Why TF-IDF over Bag of Words?
Bag of Words counts raw word occurrences, giving common words like "the" and "is" equal weight to important spam signals like "winner" or "claim". TF-IDF down-weights common words automatically, making the features more discriminative.
Implementation
max_features=5000
Limits the vocabulary to the 5,000 most informative terms across the entire corpus. This reduces dimensionality, prevents overfitting to rare noise words, and keeps the feature matrix manageable in memory.