Analysis & Modeling

Word clouds, train/test splitting, four classification models, SMOTE oversampling, and a full performance comparison before and after SMOTE.

Word Clouds

After preprocessing, the most frequent words in each class are visualized. The size of each word is proportional to its frequency in that class's corpus.

Word cloud — valid emails
Valid emails — dominated by conversational words: call, good, know, time, come, will. Natural, personal language.
Word cloud — spam emails
Spam emails — dominated by urgency and reward words: free, call, claim, text, prize, win. Manipulative, transactional language.

Data Splitting

The dataset is split into a training set (80%) used to fit the models, and a test set (20%) held out for unbiased evaluation. Stratification ensures the spam/valid ratio is preserved in both splits.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X_all, y_all, test_size=0.2, # 20 % held out for evaluation random_state=42, # reproducibility stratify=y_all, # preserves class ratio in both splits )

Training Set

4,457

80% · used to fit models

Test Set

1,115

20% · held out for evaluation

Models Before SMOTE

Four classifiers are trained on the original imbalanced training set. Each model is evaluated on the held-out test set with a confusion matrix.

Instance-based

K-Nearest Neighbors

F1 87.2%

Classifies by majority vote of the 5 nearest training examples, measured by cosine similarity between TF-IDF vectors.

KNeighborsClassifier(n_neighbors=5, metric="cosine")

Confusion Matrix

Confusion matrix for K-Nearest Neighbors
Probabilistic

Naive Bayes

F1 85.5%

Applies Bayes' theorem assuming feature independence. Naturally suited for text classification with TF-IDF counts.

MultinomialNB()

Confusion Matrix

Confusion matrix for Naive Bayes
Linear

Logistic Regression

F1 82.3%

Learns a linear decision boundary in the TF-IDF feature space, outputting the probability of spam membership.

LogisticRegression(max_iter=1000, random_state=42)

Confusion Matrix

Confusion matrix for Logistic Regression
Ensemble

Random Forest

F1 90.8%

Trains 100 independent decision trees and aggregates their votes. Robust to overfitting and handles high-dimensional text well.

RandomForestClassifier(n_estimators=100, random_state=42)

Confusion Matrix

Confusion matrix for Random Forest

SMOTE — Addressing Class Imbalance

The training set contains only 13.4% spam messages. Models trained on imbalanced data tend to favor the majority class (valid), resulting in poor spam recall. SMOTE (Synthetic Minority Over-sampling Technique) resolves this by generating synthetic spam samples in the feature space.

Before SMOTE

Valid 3,565
Spam 892

Heavily imbalanced — 20% spam

After SMOTE

Valid 3,566
Spam (+ synthetic) 3,566

Balanced — 50/50 split

from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train) # Result: training set grows from 4,457 → 7,132 samples # Both classes now have exactly 3,566 examples

Models After SMOTE

The same four models are retrained on the SMOTE-balanced training set and evaluated on the same original test set.

K-Nearest Neighbors

F1 83.9% -3.3% vs before
KNeighborsClassifier(n_neighbors=5, metric="cosine")

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for K-Nearest Neighbors

Naive Bayes

F1 89.0% +3.5% vs before
MultinomialNB()

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Naive Bayes

Logistic Regression

F1 91.6% +9.2% vs before
LogisticRegression(max_iter=1000, random_state=42)

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Logistic Regression

Random Forest

F1 93.6% +2.7% vs before
RandomForestClassifier(n_estimators=100, random_state=42)

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Random Forest

Performance Comparison

All metrics are measured on the held-out 20% test set. F1 scores reflect spam-class performance.

Key finding: SMOTE improved spam recall for 3 of 4 models. Logistic Regression improved the most (+9.2% F1). KNN was the only model to decline slightly (-3.3% F1). Random Forest achieved the highest absolute F1 after SMOTE at 93.6%.

Model Acc Before Acc After F1 Before F1 After F1 Delta
K-Nearest Neighbors 97.0% 95.0% 87.2% 83.9% -3.3%
Naive Bayes 97.0% 97.0% 85.5% 89.0% +3.5%
Logistic Regression 96.0% 98.0% 82.3% 91.6% +9.2%
Random Forest 98.0% 98.0% 90.8% 93.6% +2.7%