Perform Imports and Load Data

For this exercise we'll be using the SMSSpamCollection dataset from UCI datasets that contains more than 5 thousand SMS phone messages.
You can check out the sms_readme file for more info.

The file is a tab-separated-values (tsv) file with four columns:

label - every message is labeled as either ham or spam
message - the message itself
length - the number of characters in each message
punct - the number of punctuation characters in each message

import numpy as np
import pandas as pd

df = pd.read_csv("data_files/smsspamcollection.tsv", sep = '\t')
df.head()
label message length punct
0 ham Go until jurong point, crazy.. Available only ... 111 9
1 ham Ok lar... Joking wif u oni... 29 6
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 6
3 ham U dun say so early hor... U c already then say... 49 6
4 ham Nah I don't think he goes to usf, he lives aro... 61 2

Checking the number of rows of data

len(df)
5572

Checking for missing values :

df.isnull().sum()
label      0
message    0
length     0
punct      0
dtype: int64

Count of the target variable

df['label'].unique()
array(['ham', 'spam'], dtype=object)
df['label'].value_counts()
ham     4825
spam     747
Name: label, dtype: int64

We see that 4825 out of 5572 messages, or 86.6%, are ham.
This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.

Visualize the data

Checking for length column

df['length'].describe()
count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64

This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis.

import matplotlib.pyplot as plt
%matplotlib inline

plt.xscale('log')
bin = 1.15**(np.arange(0,50))
plt.hist(df[df['label']=='ham']['length'],bins = bin, alpha =0.8)
plt.hist(df[df['label']=='spam']['length'],bins = bin, alpha =0.8)
plt.legend(('ham', 'spam'))
plt.show()

It looks like there's a small range of values where a message is more likely to be spam than ham.

Checking for punt column

df['punct'].describe()
count    5572.000000
mean        4.177495
std         4.623919
min         0.000000
25%         2.000000
50%         3.000000
75%         6.000000
max       133.000000
Name: punct, dtype: float64
plt.xscale('log')
bins = 1.5**(np.arange(0,15))
plt.hist(df[df['label']=='ham']['punct'],bins = bins, alpha = 0.8)
plt.hist(df[df['label']=='spam']['punct'],bins = bins, alpha = 0.8)
plt.legend(('ham', 'spam'))
plt.show()

This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results.


Split the data into train & test sets:

If we wanted to divide the DataFrame into two smaller sets, we could use

train, test = train_test_split(df)

For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the label column in our data. For Features we'll use the length and punct columns. By convention, X is capitalized and y is lowercase.

Selecting features

There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:

X = df[['length','punct']]

If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:> X = df.drop(['label','message'], axis=1) These operations make copies of df, but do not change the original DataFrame in place. All the original data is preserved.

X = df[['length','punct']]
y = df['label']

Additional train/test/split arguments:

The default test size for train_test_split is 30%. Here we'll assign 33% of the data for testing.
Also, we can set a random_state seed value to ensure that everyone uses the same "random" training & testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print('Training data Shape: ', X_train.shape)
print('Testing data Shape: ', X_test.shape)
Training data Shape:  (3733, 2)
Testing data Shape:  (1839, 2)

Train a Logistic Regression classifier

One of the simplest multi-class classification tools is logistic regression. Scikit-learn offers a variety of algorithmic solvers; we'll use L-BFGS.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver = 'lbfgs')

lr_model.fit(X_train, y_train)
LogisticRegression()
from sklearn import metrics

 ###  Create a prediction set :
predictions = lr_model.predict(X_test)

 ### Confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
[[1547   46]
 [ 241    5]]
df_conf_lr = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index = ['ham', 'spam'], columns = ['ham', 'spam'])
df_conf_lr
ham spam
ham 1547 46
spam 241 5

These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam.

Print a classification report

print(metrics.classification_report(y_test, predictions))
              precision    recall  f1-score   support

         ham       0.87      0.97      0.92      1593
        spam       0.10      0.02      0.03       246

    accuracy                           0.84      1839
   macro avg       0.48      0.50      0.47      1839
weighted avg       0.76      0.84      0.80      1839

Print overall accuracy

print(metrics.accuracy_score(y_test, predictions))
0.843936922240348

This model performed *worse* than a classifier that assigned all messages as "ham" would have!


Train a naïve Bayes classifier:

One of the most common - and successful - classifiers is naïve Bayes.

from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)
MultinomialNB()
predictions_nb = nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test, predictions_nb))
[[1583   10]
 [ 246    0]]

The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]

print(metrics.classification_report(y_test,predictions_nb))
              precision    recall  f1-score   support

         ham       0.87      0.99      0.93      1593
        spam       0.00      0.00      0.00       246

    accuracy                           0.86      1839
   macro avg       0.43      0.50      0.46      1839
weighted avg       0.75      0.86      0.80      1839

print(metrics.accuracy_score(y_test, predictions_nb))
0.8607939097335509

Train a support vector machine (SVM) classifier

Among the SVM options available, we'll use C-Support Vector Classification (SVC)

from sklearn.svm import SVC
svc_model = SVC(gamma='auto')
svc_model.fit(X_train,y_train)
SVC(gamma='auto')
predictions_svm = svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions_svm))
[[1515   78]
 [ 131  115]]
print(metrics.classification_report(y_test,predictions_svm))
              precision    recall  f1-score   support

         ham       0.92      0.95      0.94      1593
        spam       0.60      0.47      0.52       246

    accuracy                           0.89      1839
   macro avg       0.76      0.71      0.73      1839
weighted avg       0.88      0.89      0.88      1839

print(metrics.accuracy_score(y_test,predictions))
0.843936922240348