Analysis — MIS 542 Spam Detector

Word Clouds

After preprocessing, the most frequent words in each class are visualized. The size of each word is proportional to its frequency in that class's corpus.

Word cloud — valid emails — Valid emails — dominated by conversational words: call, good, know, time, come, will. Natural, personal language.

Word cloud — spam emails — Spam emails — dominated by urgency and reward words: free, call, claim, text, prize, win. Manipulative, transactional language.

Data Splitting

The dataset is split into a training set (80%) used to fit the models, and a test set (20%) held out for unbiased evaluation. Stratification ensures the spam/valid ratio is preserved in both splits.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all,
    test_size=0.2,       # 20 % held out for evaluation
    random_state=42,    # reproducibility
    stratify=y_all,    # preserves class ratio in both splits
)
    

Training Set

4,457

80% · used to fit models

Test Set

1,115

20% · held out for evaluation

3c — Round 1

Models Before SMOTE

Four classifiers are trained on the original imbalanced training set. Each model is evaluated on the held-out test set with a confusion matrix.

Instance-based

K-Nearest Neighbors

F1 87.2%

Classifies by majority vote of the 5 nearest training examples, measured by cosine similarity between TF-IDF vectors.

KNeighborsClassifier(n_neighbors=5, metric="cosine")

Confusion Matrix

Probabilistic

Naive Bayes

F1 85.5%

Applies Bayes' theorem assuming feature independence. Naturally suited for text classification with TF-IDF counts.

MultinomialNB()

Confusion Matrix

Linear

Logistic Regression

F1 82.3%

Learns a linear decision boundary in the TF-IDF feature space, outputting the probability of spam membership.

LogisticRegression(max_iter=1000, random_state=42)

Confusion Matrix

Ensemble

Random Forest

F1 90.8%

Trains 100 independent decision trees and aggregates their votes. Robust to overfitting and handles high-dimensional text well.

RandomForestClassifier(n_estimators=100, random_state=42)

Confusion Matrix

SMOTE — Addressing Class Imbalance

The training set contains only 13.4% spam messages. Models trained on imbalanced data tend to favor the majority class (valid), resulting in poor spam recall. SMOTE (Synthetic Minority Over-sampling Technique) resolves this by generating synthetic spam samples in the feature space.

Before SMOTE

Heavily imbalanced — 20% spam

After SMOTE

            Valid
            3,566
          
            Spam (+ synthetic)
            3,566

Balanced — 50/50 split

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Result: training set grows from 4,457 → 7,132 samples
# Both classes now have exactly 3,566 examples
    

3e — Round 2

Models After SMOTE

The same four models are retrained on the SMOTE-balanced training set and evaluated on the same original test set.

K-Nearest Neighbors

F1 83.9% -3.3% vs before

KNeighborsClassifier(n_neighbors=5, metric="cosine")

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for K-Nearest Neighbors

Naive Bayes

F1 89.0% +3.5% vs before

MultinomialNB()

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Naive Bayes

Logistic Regression

F1 91.6% +9.2% vs before

LogisticRegression(max_iter=1000, random_state=42)

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Logistic Regression

Random Forest

F1 93.6% +2.7% vs before

RandomForestClassifier(n_estimators=100, random_state=42)

Confusion Matrix (Post-SMOTE)

Post-SMOTE confusion matrix for Random Forest

Performance Comparison

All metrics are measured on the held-out 20% test set. F1 scores reflect spam-class performance.

Key finding: SMOTE improved spam recall for 3 of 4 models. Logistic Regression improved the most (+9.2% F1). KNN was the only model to decline slightly (-3.3% F1). Random Forest achieved the highest absolute F1 after SMOTE at 93.6%.

Model	Acc Before	Acc After	F1 Before	F1 After	F1 Delta
K-Nearest Neighbors	97.0%	95.0%	87.2%	83.9%	-3.3%
Naive Bayes	97.0%	97.0%	85.5%	89.0%	+3.5%
Logistic Regression	96.0%	98.0%	82.3%	91.6%	+9.2%
Random Forest	98.0%	98.0%	90.8%	93.6%	+2.7%

Analysis & Modeling

Word Clouds

Data Splitting

Models Before SMOTE

K-Nearest Neighbors

Naive Bayes

Logistic Regression

Random Forest

SMOTE — Addressing Class Imbalance

Models After SMOTE

K-Nearest Neighbors

Naive Bayes

Logistic Regression

Random Forest

Performance Comparison