by Alan Gatt

The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here. Also previous article can be found here.

The main aim of this article is to “guess” the review score from the review text. This will be done by training a machine learning model and then using this model to predict the score.

Loading the Data

First we load the data, keep only the English reviews, and choose the columns Score and Review which in this case are enough for our purpose.

import pandas as pd

data = pd.read_csv('reviews.csv', encoding='utf-16') #load file
data = data[data['Language'] == 'en'] #keep English reviews
data = data[['Score', 'Review']] #select Score and Review columns
data.Score = data.Score.astype(int) #convert Score to Integer

Trying to predict the actual score 10, 20, 30, 40 and 50 can be difficult since reviews having close scores can be quite the same. So, we will be create a category

  • 10, 20 will be changed to Negative
  • 30 will be changed to Neutral
  • 40, 50 will be changed to Positive
#function to create labels
def create_label(row):
    if row['Score'] < 30:
        return 'Negative'
    elif row['Score'] > 30:
        return 'Positive'
    else:
        return 'Neutral'

#create the labels
data['Label'] = data.apply(create_label, axis=1)

#show label distribution
data['Label'].value_counts()
Positive    235
Negative    128
Neutral      42
Name: Label, dtype: int64

Cleaning the Review Text

The next task is to apply some cleaning to the text in order to reduce inconsitencies and variety. The following tasks are performed:

  • Keep only letters from a to z
  • Change all letters to lowercase
  • Split all reviews into individual words
  • Lemmatize (keep the root) for each word
  • Filter out any stop words
  • Combine the words back into a “sentence”

During this process a corpus will be created that will store all the reviews combined together.

import re
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data.iloc[i]['Review'])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

Create a Word Matrix

CountVectorizer will count the number of words found throughout the reviews. So for instance it will count how many times the word pizza appears. The parameter max_features will limit this dictionary to the top 2000 words. This number can be changed accordingly to obtain better results.

The next step is to convert this count into an array for each review. Every instance of review will have 2000 columns (one for each word) and if that word is present in that particular review it will have a value of 1, otherwise 0.

Finally, another array holding only the labels (Negative, Neutral, and Positive) will be created.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000) #select top 2000 words

X = cv.fit_transform(corpus).toarray() #convert to a binary matrix

y = data.iloc[:, 2].values #store the score label for each review

Training a Model

In this case we have a classification task, because depending on various words found in a review, we are trying to allocate a label. So we must train a classification model. We will be using the scikit-learn package and there are various models that can be used as you can see here.

For this example, we will be using just two models selected randomly, and we will not tweak any parameters, in order to keep things simple. Parameters can be tweaked accordingly to have obtain results.

When training a model the data set must be split into a training and a test set. The training data set will be used to train the model; this will have all the review words and labels. The test set is data that the trained model did not “see” and by looking at the review words it will try to allocate a label by using the knowledge obtained from the training. In this case we will split the data 75% for training and 25% for test.

from  sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

For this example we will use a RandomForest and a SVC. As stated before no parameters will be changed.

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

#create the models
model_random = RandomForestClassifier()
model_svc = SVC()

#train the models X_train contains the review words, Y_train the labels
model_random.fit(X_train, y_train)
model_svc.fit(X_train, y_train)

#predict the labels on the test data set (note: this does not include the actual score labels).
y_pred_random = model_random.predict(X_test)
y_pred_svc = model_svc.predict(X_test)

Confusion Matrix

In order to evaluate how a model has performed we can create a confusion matrix which shows how each item was correctly classified.

By observing the confusion matrix, one can see how many reviews were correctly classified. So, if we take the first matrix, one can observe that from 35 negative reviews, 19 were predicted correctly and 16 were incorrectly labeled as positive.

from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

plot_confusion_matrix(model_random, X_test, y_test)
plt.show()

plot_confusion_matrix(model_svc, X_test, y_test)
plt.show()
Confusion Matrix for Random Forest
Confusion Matrix for SVC

Classification Scoring

Rather than just looking at confusion matrices, one can also calculate the F1 score

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. Source

So from the following we can observe that the SVC performs slightly better than the Random Forests.

from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred_random, average="micro"))
print(f1_score(y_test, y_pred_svc, average="micro"))
0.7156862745098039
0.7254901960784313

Conclusion

This was just an overview of how classification works. In general, results can be improved as follows:

  • Change the data split for training, and testing. Be careful of under, or over-fitting.
  • Use different Classification algorithms.
  • Tweak parameters for each algorithm.