4.2 Text Classification using Tfid
The post explains how we classify movies reviews in positive and negative using TfidfVectorizier and classification machine learning model. Also it shows how pipelines are built to generate results.
- Perform imports and load the dataset
- Check for missing values:
- Model building
- Advanced Topic - Adding Stopwords to CountVectorizer
- Testing the model
In this sections, we will do the following :
- Read in a collection of documents - a corpus
- Transform text into numerical vector data using a pipeline
- Create a classifier
- Fit/train the classifier
- Test the classifier on new data
- Evaluate performance
For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/
In this exercise we'll try to develop a classification model as we did for the SMSSpamCollection dataset - that is, we'll try to predict the Positive/Negative labels based on text content alone. In an upcoming section we'll apply Sentiment Analysis to train models that have a deeper understanding of each review.
import numpy as np
import pandas as pd
df = pd.read_csv('data_files/moviereviews.tsv', sep = '\t')
df.head()
len(df)
First review
print(df['review'][0])
Check for missing values:
We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.
- NaN records are efficiently handled with .isnull() and .dropna()
- Strings that contain only whitespace can be handled with .isspace(), .itertuples(), and .drop()
df.isnull().sum()
35 records show NaN (this stands for "not a number" and is equivalent to None). These are easily removed using the .dropna()
pandas function.
df.dropna(inplace = True)
len(df)
Detect & remove empty strings
Technically, we're dealing with "whitespace only" strings. If the original .tsv file had contained empty strings, pandas .read_csv() would have assigned NaN values to those cells by default.
In order to detect these strings we need to iterate over each row in the DataFrame. The .itertuples() pandas method is a good tool for this as it provides access to every field. For brevity we'll assign the names i
, lb
and rv
to the index
, label
and review
columns.
blanks = []
for i,lb,rv in df.itertuples():
if type(rv)==str:
if rv.isspace():
blanks.append(i)
print(blanks)
df['review'][147]
Next we'll pass our list of index numbers to the .drop() method, and set inplace=True
to make the change permanent.
df.drop(blanks, inplace = True)
len(df)
df['label'].value_counts()
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
The following action are done by the pipeline
Vectorize the data >> train the model >> fit the model
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
Building pipeline for Naive Bayes
text_clf_nb = Pipeline([('tfidf',TfidfVectorizer()),
('clf', MultinomialNB())])
Bulding pipeline for SVM
text_clf_lsvc = Pipeline([('tfidf',TfidfVectorizer()),
('clf', LinearSVC())])
text_clf_nb.fit(X_train, y_train)
Results from Naive Bayes Model
predictions_nb= text_clf_nb.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_nb))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_nb))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_nb))
Linear SVC mdoel
text_clf_lsvc.fit(X_train, y_train)
predictions_lsvc= text_clf_lsvc.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_lsvc))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_lsvc))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_lsvc))
Based on text alone we correctly classified reviews as positive or negative 84.7% of the time.
Advanced Topic - Adding Stopwords to CountVectorizer
By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.
The CountVectorizer class accepts the following arguments:
CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
TfidVectorizer supports the same arguments and more. Under stop_words we have the following options:> stop_words :string {'english'}, list, or None (default)
That is, we can run TfidVectorizer(stop_words='english')
to accept scikit-learn's built-in list,
or TfidVectorizer(stop_words=[a, and, the])
to filter these three words. In practice we would assign our list to a variable and pass that in instead.
Scikit-learn's built-in list contains 318 stopwords:
from sklearn.feature_extraction import text print(text.ENGLISH_STOP_WORDS)['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']
However, there are words in this list that may influence a classification of movie reviews. With this in mind, let's trim the list to just 60 words:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']
text_clf_lsvc2 = Pipeline([('tfidf',TfidfVectorizer(stop_words = stopwords)),
('clf', LinearSVC())])
text_clf_lsvc2.fit(X_train, y_train)
predictions_lsvc2= text_clf_lsvc2.predict(X_test)
from sklearn import metrics
print("Confusion Matrix \n\n")
print(metrics.confusion_matrix(y_test, predictions_lsvc2))
print("\n\nClassification Report \n\n")
print(metrics.classification_report(y_test, predictions_lsvc2))
print("\n\nAccuracy\n")
print(metrics.accuracy_score(y_test, predictions_lsvc2))
We went from 84.7% without filtering stopwords to 84.4% after adding a stopword filter to our pipeline. Keep in mind that 2000 movie reviews is a relatively small dataset. The real gain from stripping stopwords is improved processing speed; depending on the size of the corpus, it might save hours.
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."
myreview = "useless movies"
print(text_clf_nb.predict([myreview])) # be sure to put "myreview" inside square brackets
print(text_clf_lsvc.predict([myreview]))