by Alan Gatt
The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here. Also previous article can be found here.
The main aim of this article is to “guess” the review score from the review text. This will be done by training a machine learning model and then using this model to predict the score.
Loading the Data
First we load the data, keep only the English reviews, and choose the columns Score and Review which in this case are enough for our purpose.
import pandas as pd data = pd.read_csv('reviews.csv', encoding='utf-16') #load file data = data[data['Language'] == 'en'] #keep English reviews data = data[['Score', 'Review']] #select Score and Review columns data.Score = data.Score.astype(int) #convert Score to Integer
Trying to predict the actual score 10, 20, 30, 40 and 50 can be difficult since reviews having close scores can be quite the same. So, we will be create a category
- 10, 20 will be changed to Negative
- 30 will be changed to Neutral
- 40, 50 will be changed to Positive
#function to create labels def create_label(row): if row['Score'] < 30: return 'Negative' elif row['Score'] > 30: return 'Positive' else: return 'Neutral' #create the labels data['Label'] = data.apply(create_label, axis=1) #show label distribution data['Label'].value_counts()
Positive 235 Negative 128 Neutral 42 Name: Label, dtype: int64
Cleaning the Review Text
The next task is to apply some cleaning to the text in order to reduce inconsitencies and variety. The following tasks are performed:
- Keep only letters from a to z
- Change all letters to lowercase
- Split all reviews into individual words
- Lemmatize (keep the root) for each word
- Filter out any stop words
- Combine the words back into a “sentence”
During this process a corpus will be created that will store all the reviews combined together.
import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer corpus =  for i in range(0, len(data)): review = re.sub('[^a-zA-Z]', ' ', data.iloc[i]['Review']) review = review.lower() review = review.split() ps = PorterStemmer() review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review)
Create a Word Matrix
CountVectorizer will count the number of words found throughout the reviews. So for instance it will count how many times the word pizza appears. The parameter max_features will limit this dictionary to the top 2000 words. This number can be changed accordingly to obtain better results.
The next step is to convert this count into an array for each review. Every instance of review will have 2000 columns (one for each word) and if that word is present in that particular review it will have a value of 1, otherwise 0.
Finally, another array holding only the labels (Negative, Neutral, and Positive) will be created.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features=2000) #select top 2000 words X = cv.fit_transform(corpus).toarray() #convert to a binary matrix y = data.iloc[:, 2].values #store the score label for each review
Training a Model
In this case we have a classification task, because depending on various words found in a review, we are trying to allocate a label. So we must train a classification model. We will be using the scikit-learn package and there are various models that can be used as you can see here.
For this example, we will be using just two models selected randomly, and we will not tweak any parameters, in order to keep things simple. Parameters can be tweaked accordingly to have obtain results.
When training a model the data set must be split into a training and a test set. The training data set will be used to train the model; this will have all the review words and labels. The test set is data that the trained model did not “see” and by looking at the review words it will try to allocate a label by using the knowledge obtained from the training. In this case we will split the data 75% for training and 25% for test.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
For this example we will use a RandomForest and a SVC. As stated before no parameters will be changed.
from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC #create the models model_random = RandomForestClassifier() model_svc = SVC() #train the models X_train contains the review words, Y_train the labels model_random.fit(X_train, y_train) model_svc.fit(X_train, y_train) #predict the labels on the test data set (note: this does not include the actual score labels). y_pred_random = model_random.predict(X_test) y_pred_svc = model_svc.predict(X_test)
In order to evaluate how a model has performed we can create a confusion matrix which shows how each item was correctly classified.
By observing the confusion matrix, one can see how many reviews were correctly classified. So, if we take the first matrix, one can observe that from 35 negative reviews, 19 were predicted correctly and 16 were incorrectly labeled as positive.
from sklearn.metrics import plot_confusion_matrix import matplotlib.pyplot as plt plot_confusion_matrix(model_random, X_test, y_test) plt.show() plot_confusion_matrix(model_svc, X_test, y_test) plt.show()
Rather than just looking at confusion matrices, one can also calculate the F1 score
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. Source
So from the following we can observe that the SVC performs slightly better than the Random Forests.
from sklearn.metrics import f1_score print(f1_score(y_test, y_pred_random, average="micro")) print(f1_score(y_test, y_pred_svc, average="micro"))
This was just an overview of how classification works. In general, results can be improved as follows:
- Change the data split for training, and testing. Be careful of under, or over-fitting.
- Use different Classification algorithms.
- Tweak parameters for each algorithm.