by Alan Gatt

This article continues on the series that automatically scraped information from TripAdvisor. The other series links can be found here:

The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here.

The main aim of this article is to start exploring the data found in the scraped reviews. Various techniques will be used to obtain insights.

Loading Required Libraries

First thing is to load all required libraries, you will need to install some of them. You can get the environment file from here.

import pandas as pd
import datetime
import matplotlib.pyplot as plt
import nltk
import unicodedata
import re
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

Review File

For this example, all reviews for a particular restaurant were downloaded into a csv file, now this data will be loaded into a pandas dataframe.

This dataset is made up of five columns:

  • Score – the score given to the restaurant per review
  • Date – when the review was submitted
  • Title – the title for the review
  • Review – the review text
  • Language – the language for the review

By using shape one can observe that apart from the four columns there are 1250 records.

#encoding utf-16 is used to cater for a large variet of characters including emojis
data = pd.read_csv("./reviews.csv", encoding='utf-16')
print(data.head()) #first 5 records
print(data.shape) #structure of dataframe (rows, columns)
 Score             Date                         Title  \
0     50  August 15, 2020  Tappa culinaria obbligatoria   
1     40  August 13, 2020                   Resto sympa   
2     50   August 9, 2020          Storie e VERI Sapori   
3     50   August 8, 2020          We love your pizzas!   
4     50   August 7, 2020                        #OSEMA   

                                              Review Language  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  
2  Abbiamo cenato presso questo ristorante in una...       it  
3  I went to this restaurant yesterday evening wi...       en  
4  Cena di coppia molto piacevole, staff cordiale...       it  

Cleaning the Date

The review date’s format (e.g. August 15, 2020) is not suitable for our analysis, so the first step is to re-format the date.

For this task, a function will be created. The function will read the date, identify the different parts (month, day, and year) and then extract the desired parts, in this case only the month and year.

def formatDate(val):
    return datetime.datetime.strptime(val,'%B %d, %Y').strftime('%m/%Y')

# lambda is used to apply a function to all rows in a data frame
data['Date'] = data['Date'].apply(lambda x: formatDate(x))
print (data['Date'].head(2))

# the date must be changed into a date format so that it will be easier for plotting
data['Date'] = pd.to_datetime(data['Date'])
print(data['Date'].head(2))
0    08/2020
1    08/2020
Name: Date, dtype: object
0   2020-08-01
1   2020-08-01
Name: Date, dtype: datetime64[ns]

 In certain cases plotting will be done based on the year, so a new column will be created with just the year.

data['Year'] = data['Date'].dt.year
print(data.head(2))
Score       Date                         Title  \
0     50 2020-08-01  Tappa culinaria obbligatoria   
1     40 2020-08-01                   Resto sympa   

                                              Review Language  Year  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  2020  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  2020  

 Score Cleanup

The score must be divided by ten as well to obtain a number from 1 to 5, since right now it’s from 10 to 50.

data['Score'] = (data['Score'] / 10).astype(int)
print(data.head(2))
 Score       Date                         Title  \
0      5 2020-08-01  Tappa culinaria obbligatoria   
1      4 2020-08-01                   Resto sympa   

                                              Review Language  Year  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  2020  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  2020  

 

Start Exploring

In order to get an idea of how this restaurant fares, a chart will be generated showing the total number of reviews per score.

The function value_counts() will be used to produce a count for each of the unique values, so in this case, it will count how many reviews there are per score.

sort_index() is used so that the result is shown sorted by the score.

# seaborn is a styling setting to make charts look nicer
plt.style.use('seaborn')
data['Score'].value_counts().sort_index().plot(kind='bar')
plt.title('Score Distribution')
plt.xlabel('Score')
plt.ylabel('Total Reviews')

One can also observe how many reviews were done in each year.

data_score=data['Year'].value_counts().sort_index().plot()
plt.title('Number of Reviews Per Year')
plt.xlabel('Year')
plt.ylabel('Reviews')

We can refine this and see the number of reviews per month.

data['Date'].value_counts().sort_index().plot.line(figsize=(10,5))
plt.title('Reviews')
plt.xlabel('Date')
plt.ylabel('Review Count')
plt.show()

Further Investigation

Another interesting observation would be the average score per year, to see how it fared over the years.

For this, we first calculate the total number of reviews, along with the average score, for each year.

# in this case we are grouping by the column Year, and for each group
# calculate how many reviews there were and what their average score is.
data_score = data.groupby("Year", as_index=False)\
    .agg(('count', 'mean'))\
    .reset_index()
print(data_score)
Year Score          
        count      mean
0  2015     6  4.000000
1  2016   219  3.716895
2  2017   269  3.579926
3  2018   287  3.494774
4  2019   397  4.622166
5  2020    72  4.736111

The next step is to plot the year against the mean score. We will also add the overall (across all years) mean so that we can compare each year against the overall mean.

plt.plot(data_score['Year'], data_score['Score']['mean'])
plt.title('Average Score per Year')
plt.xlabel('Date')
plt.ylabel('Score')
plt.axhline(data_score["Score"]['mean'].mean(), color='green', linestyle='--')
plt.show()

Sometimes it would also be interesting analysing the length of each review. Since review length can vary a lot, we will group them into three groups using the function cut from pandas.

You can see that most of the reviews fall under 1100 characters.

print (pd.cut(data['Review'].str.len(),3, include_lowest=True).value_counts())
(57.855000000000004, 1109.0]    1212
(1109.0, 2157.0]                  33
(2157.0, 3205.0]                   5
Name: Review, dtype: int64

In order to try to obtain better insights, we will ignore the longish reviews and focus on the smaller reviews.

One can see that most of the reviews still have less than 404 characters.

shorter_reviews = data[data['Review'].str.len()<1100]
shorter_reviews = pd.cut(shorter_reviews['Review'].str.len(),3).value_counts()
print(shorter_reviews)
(59.97, 404.333]      908
(404.333, 747.667]    241
(747.667, 1091.0]      63
Name: Review, dtype: int64

 

We can plot the data to have a visual indication.

shorter_reviews = data[data['Review'].str.len()<1100]
# in this case labels are passed to the cut function to display proper labels
shorter_reviews = pd.cut(shorter_reviews['Review'].str.len(),3, labels=['Shortest', 'Short', 'Medium']).value_counts()
#rot=0 is used to rotate the labels in the x-axis.
shorter_reviews.plot.bar(rot=0)
plt.title('Review Character Distribution')
plt.xlabel('Number of Characters')
plt.ylabel('Total Reviews')

Here we can see the average score by review length.

One can observe, the longer the review is the lower the score will be.

# a copy of the original data is done
data2 = data

# new column Bins is added to store under which category it falls
data2['Bins'] = pd.cut(data['Review'].str.len(),6, include_lowest=True, labels=['Shortest', 'Short', 'Medium', 'Long', 'Longer','Longest'])
print(data2.head(2))

# then each Bin is grouped to find it's average and count
data_score = data2.groupby("Bins")['Score'].agg(('count', 'mean'))
print(data_score)
Score       Date                         Title  \
0      5 2020-08-01  Tappa culinaria obbligatoria   
1      4 2020-08-01                   Resto sympa   

                                              Review Language  Year      Bins  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  2020  Shortest  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  2020  Shortest  
          count      mean
Bins                     
Shortest   1072  4.182836
Short       140  2.985714
Medium       29  2.172414
Long          4  2.000000
Longer        2  1.500000
Longest       3  1.333333

 

Next part will deal with analysing text.