In this sections, we will do the following :

  • Read in a collection of documents - a corpus
  • Transform text into numerical vector data using a pipeline
  • Create a classifier
  • Fit/train the classifier
  • Test the classifier on new data
  • Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/

In this exercise we'll try to develop a classification model as we did for the SMSSpamCollection dataset - that is, we'll try to predict the Positive/Negative labels based on text content alone. In an upcoming section we'll apply Sentiment Analysis to train models that have a deeper understanding of each review.

Perform imports and load the dataset

The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

import numpy as np
import pandas as pd

df = pd.read_csv('data_files/moviereviews.tsv', sep = '\t')
df.head()
label review
0 neg how do films like mouse hunt get into theatres...
1 neg some talented actresses are blessed with a dem...
2 pos this has been an extraordinary year for austra...
3 pos according to hollywood movies made in last few...
4 neg my first press screening of 1998 and already i...
len(df)
2000

First review

print(df['review'][0])
how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternates between unfunny scenes of the brothers bickering over what to do with their inheritance and endless action sequences as the two take on their increasingly determined furry foe . 
whatever promise the film starts with soon deteriorates into boring dialogue , terrible overacting , and increasingly uninspired slapstick that becomes all sound and fury , signifying nothing . 
the script becomes so unspeakably bad that the best line poor lee evens can utter after another run in with the rodent is : " i hate that mouse " . 
oh cringe ! 
this is home alone all over again , and ten times worse . 
one touching scene early on is worth mentioning . 
we follow the mouse through a maze of walls and pipes until he arrives at his makeshift abode somewhere in a wall . 
he jumps into a tiny bed , pulls up a makeshift sheet and snuggles up to sleep , seemingly happy and just wanting to be left alone . 
it's a magical little moment in an otherwise soulless film . 
a message to speilberg : if you want dreamworks to be associated with some kind of artistic credibility , then either give all concerned in mouse hunt a swift kick up the arse or hire yourself some decent writers and directors . 
this kind of rubbish will just not do at all . 

Check for missing values:

We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.

Detect and remove NaN values

df.isnull().sum()
label      0
review    35
dtype: int64

35 records show NaN (this stands for "not a number" and is equivalent to None). These are easily removed using the .dropna() pandas function.

CAUTION: By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can't be undone. However, it does *not* affect the original source data. If we needed to, we could always load the original DataFrame from scratch.
df.dropna(inplace = True)
len(df)
1965

Detect & remove empty strings

Technically, we're dealing with "whitespace only" strings. If the original .tsv file had contained empty strings, pandas .read_csv() would have assigned NaN values to those cells by default.

In order to detect these strings we need to iterate over each row in the DataFrame. The .itertuples() pandas method is a good tool for this as it provides access to every field. For brevity we'll assign the names i, lb and rv to the index, label and review columns.

blanks = []

for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
print(blanks)
[57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]
df['review'][147]
'  '

Next we'll pass our list of index numbers to the .drop() method, and set inplace=True to make the change permanent.

df.drop(blanks, inplace = True)
len(df)
1938
 
df['label'].value_counts()
neg    969
pos    969
Name: label, dtype: int64

Model building

Splitting into train and test set.

from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
 

The following action are done by the pipeline

Vectorize the data >> train the model >> fit the model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

Building pipeline for Naive Bayes

text_clf_nb = Pipeline([('tfidf',TfidfVectorizer()),
                       ('clf', MultinomialNB())])

Bulding pipeline for SVM

text_clf_lsvc = Pipeline([('tfidf',TfidfVectorizer()),
                       ('clf', LinearSVC())])

Naive Bayes model

text_clf_nb.fit(X_train, y_train)
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

Results from Naive Bayes Model

predictions_nb= text_clf_nb.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_nb))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_nb))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_nb))
Confusion Matrix 


[[287  21]
 [130 202]]


Classification Report 


              precision    recall  f1-score   support

         neg       0.69      0.93      0.79       308
         pos       0.91      0.61      0.73       332

    accuracy                           0.76       640
   macro avg       0.80      0.77      0.76       640
weighted avg       0.80      0.76      0.76       640



Accuracy

0.7640625

Linear SVC mdoel

text_clf_lsvc.fit(X_train, y_train)
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
predictions_lsvc= text_clf_lsvc.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_lsvc))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_lsvc))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_lsvc))
Confusion Matrix 


[[259  49]
 [ 49 283]]


Classification Report 


              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       308
         pos       0.85      0.85      0.85       332

    accuracy                           0.85       640
   macro avg       0.85      0.85      0.85       640
weighted avg       0.85      0.85      0.85       640



Accuracy

0.846875

Based on text alone we correctly classified reviews as positive or negative 84.7% of the time.

Advanced Topic - Adding Stopwords to CountVectorizer

By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.

CAUTION: There are some [known issues](http://aclweb.org/anthology/W18-2502) using Scikit-learn's built-in stopwords list. Some words that are filtered may in fact aid in classification. In this section we'll pass in our own stopword list, so that we know exactly what's being filtered.

The CountVectorizer class accepts the following arguments:

CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

TfidVectorizer supports the same arguments and more. Under stop_words we have the following options:> stop_words :string {'english'}, list, or None (default) That is, we can run TfidVectorizer(stop_words='english') to accept scikit-learn's built-in list,
or TfidVectorizer(stop_words=[a, and, the]) to filter these three words. In practice we would assign our list to a variable and pass that in instead.

Scikit-learn's built-in list contains 318 stopwords:

from sklearn.feature_extraction import text
print(text.ENGLISH_STOP_WORDS)
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']

However, there are words in this list that may influence a classification of movie reviews. With this in mind, let's trim the list to just 60 words:

stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']
 
text_clf_lsvc2 = Pipeline([('tfidf',TfidfVectorizer(stop_words = stopwords)),
                       ('clf', LinearSVC())])
text_clf_lsvc2.fit(X_train, y_train)
Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', LinearSVC())])
predictions_lsvc2= text_clf_lsvc2.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_lsvc2))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_lsvc2))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_lsvc2))
Confusion Matrix 


[[256  52]
 [ 48 284]]


Classification Report 


              precision    recall  f1-score   support

         neg       0.84      0.83      0.84       308
         pos       0.85      0.86      0.85       332

    accuracy                           0.84       640
   macro avg       0.84      0.84      0.84       640
weighted avg       0.84      0.84      0.84       640



Accuracy

0.84375

We went from 84.7% without filtering stopwords to 84.4% after adding a stopword filter to our pipeline. Keep in mind that 2000 movie reviews is a relatively small dataset. The real gain from stripping stopwords is improved processing speed; depending on the size of the corpus, it might save hours.

Testing the model

myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."
myreview = "useless movies"
print(text_clf_nb.predict([myreview]))  # be sure to put "myreview" inside square brackets
['neg']
print(text_clf_lsvc.predict([myreview]))
['neg']