4.1 SciKit Learn Primer
The post explains the basics of SciKit Learn
Perform Imports and Load Data
For this exercise we'll be using the SMSSpamCollection dataset from UCI datasets that contains more than 5 thousand SMS phone messages.
You can check out the sms_readme file for more info.
The file is a tab-separated-values (tsv) file with four columns:
label - every message is labeled as either ham or spam
message - the message itself
length - the number of characters in each message
punct - the number of punctuation characters in each message
import numpy as np
import pandas as pd
df = pd.read_csv("data_files/smsspamcollection.tsv", sep = '\t')
df.head()
Checking the number of rows of data
len(df)
df.isnull().sum()
df['label'].unique()
df['label'].value_counts()
We see that 4825 out of 5572 messages, or 86.6%, are ham.
This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.
df['length'].describe()
This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis.
import matplotlib.pyplot as plt
%matplotlib inline
plt.xscale('log')
bin = 1.15**(np.arange(0,50))
plt.hist(df[df['label']=='ham']['length'],bins = bin, alpha =0.8)
plt.hist(df[df['label']=='spam']['length'],bins = bin, alpha =0.8)
plt.legend(('ham', 'spam'))
plt.show()
It looks like there's a small range of values where a message is more likely to be spam than ham.
df['punct'].describe()
plt.xscale('log')
bins = 1.5**(np.arange(0,15))
plt.hist(df[df['label']=='ham']['punct'],bins = bins, alpha = 0.8)
plt.hist(df[df['label']=='spam']['punct'],bins = bins, alpha = 0.8)
plt.legend(('ham', 'spam'))
plt.show()
This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results.
Split the data into train & test sets:
If we wanted to divide the DataFrame into two smaller sets, we could use
train, test = train_test_split(df)
For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the label
column in our data. For Features we'll use the length
and punct
columns. By convention, X is capitalized and y is lowercase.
Selecting features
There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:
X = df[['length','punct']]
If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:> X = df.drop(['label','message'], axis=1)
These operations make copies of df, but do not change the original DataFrame in place. All the original data is preserved.
X = df[['length','punct']]
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print('Training data Shape: ', X_train.shape)
print('Testing data Shape: ', X_test.shape)
Train a Logistic Regression classifier
One of the simplest multi-class classification tools is logistic regression. Scikit-learn offers a variety of algorithmic solvers; we'll use L-BFGS.
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver = 'lbfgs')
lr_model.fit(X_train, y_train)
from sklearn import metrics
### Create a prediction set :
predictions = lr_model.predict(X_test)
### Confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
df_conf_lr = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index = ['ham', 'spam'], columns = ['ham', 'spam'])
df_conf_lr
These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam.
Print a classification report
print(metrics.classification_report(y_test, predictions))
Print overall accuracy
print(metrics.accuracy_score(y_test, predictions))
This model performed *worse* than a classifier that assigned all messages as "ham" would have!
Train a naïve Bayes classifier:
One of the most common - and successful - classifiers is naïve Bayes.
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)
predictions_nb = nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test, predictions_nb))
The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]
print(metrics.classification_report(y_test,predictions_nb))
print(metrics.accuracy_score(y_test, predictions_nb))
Train a support vector machine (SVM) classifier
Among the SVM options available, we'll use C-Support Vector Classification (SVC)
from sklearn.svm import SVC
svc_model = SVC(gamma='auto')
svc_model.fit(X_train,y_train)
predictions_svm = svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions_svm))
print(metrics.classification_report(y_test,predictions_svm))
print(metrics.accuracy_score(y_test,predictions))