by Alan Gatt

This article continues on the previous article in which we analysed reviews obtain from TripAdvisor. The other article can be found here.

The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here.

The main aim of this article is to start exploring the data found in the scraped reviews. Various techniques will be used to obtain insights.

Loading Required Libraries

First thing is to load all required libraries, you will need to install some of them. You can get the environment file from here.

import pandas as pd
import datetime
import matplotlib.pyplot as plt
import nltk
import unicodedata
import re
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

Loading the Data

We will load the same data file that was used for the previous article.

#encoding utf-16 is used to cater for a large variet of characters including emojis
data = pd.read_csv("./reviews.csv", encoding='utf-16')
print(data.head()) #first 5 records
 Score             Date                         Title  \
0     50  August 15, 2020  Tappa culinaria obbligatoria   
1     40  August 13, 2020                   Resto sympa   
2     50   August 9, 2020          Storie e VERI Sapori   
3     50   August 8, 2020          We love your pizzas!   
4     50   August 7, 2020                        #OSEMA   

                                              Review Language  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  
2  Abbiamo cenato presso questo ristorante in una...       it  
3  I went to this restaurant yesterday evening wi...       en  
4  Cena di coppia molto piacevole, staff cordiale...       it  

 

Cleaning the Data

For this example, only the score will be cleaned.

data['Score'] = (data['Score'] / 10).astype(int)
print(data.head(2))
 Score             Date                         Title  \
0      5  August 15, 2020  Tappa culinaria obbligatoria   
1      4  August 13, 2020                   Resto sympa   

                                              Review Language  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  
1  Resto sympa tout comme les serveurs. Seul peti...       fr 

 

Text Analysis

In this case, the file has a column showing the language of the reviews. The following is an example of what can be done if this column was not present.

Library langdetect can be used to identify the languages.

from langdetect import detect
data['Detected Lang'] = data['Review'].apply(lambda x: detect(x))
data['Detected Lang'].value_counts()
it    732
en    406
fr     37
de     24
es     21
el      6
nl      6
pl      4
ja      4
no      3
sv      2
ru      2
cs      1
pt      1
ko      1
Name: Detected Lang, dtype: int64

 


The following shows the language of original reviews, and you can see langdetect has identified all languages correctly.

data['Language'].value_counts()
it    733
en    405
fr     37
de     24
es     21
el      6
nl      6
pl      4
ja      4
no      3
sv      2
ru      2
cs      1
pt      1
ko      1
Name: Language, dtype: int64

 

For the following exploration, only English reviews will be used.

data = data[data['Language']=='en']

Word Clouds

Using word clouds is an easy way of seeing the most frequently used words.

First, we extract all the words from all the reviews using the join function. This will create a variable containing all the words from all the reviews.

Then all the text is lowercased, this will make sure that words written in different caps are still considered the same words.

Stop words (which are common words like the, when, was) are removed so as not to be included in the cloud.

text = text = " ".join(review for review in data.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

You can limit the number of words as well.

wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=8).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

 

Sometimes, it makes sense to generate a word cloud for the negative reviews, and another one for the positive reviews.

reviews_bad = data[data['Score']<3]
text = text = " ".join(review for review in reviews_bad.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

We can repeat the same process for the positive ones.

reviews_good = data[data['Score']>3]
text = text = " ".join(review for review in reviews_good.Review)
text = text.lower()

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Another way of seeing the most frequent terms is to use CountVectorizer from scikit-learn package.

from nltk.corpus import stopwords
stops =  set(stopwords.words('english'))
co = CountVectorizer(stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0
food 260
good 226
pizza 204
staff 176
service 169
restaurant 143
place 140
us 133
one 127
great 120
nice 110
italian 101
waiter 100
friendly 96
ordered 94
pasta 87
would 85
came 77
wine 76
table 75

Using CountVectorizer we can also obtain ngrams (sets of words) rather than a single word. In this case, we are seeing the most frequent bi-grams (2 words).

You can increase the ngram_range to obtain longer sequences of words.

stops =  set(stopwords.words('english'))
co = CountVectorizer(ngram_range=(2,2), stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0
storie sapori 30
food good 24
friendly staff 18
good service 16
italian food 16
barrakka gardens 15
really good 15
staff friendly 15
would recommend 13
good food 13
upper barrakka 13
food great 12
italian restaurant 12
pizza good 11
excellent service 11
go back 11
well done 11
great location 11
food service 11
best pizza 11

One can also clean the data to their desired level. Sometimes it’s not possible to entirely clean the data, since it will omit certain details.

In this case a function is created that does the following:

  • Load stemmer (words are returned to their root – for instance running and runs, will be returned as run)
  • A list of stop words is loaded
  • The text is normalized meanining it will replace all compatibility characters with their equivalents
  • The text is then encoded to ASCII and then decoded to utf-8. This will make sure that we only have required characters
  • Text is lower cased.
  • A list of words is created from the sentence, and only valid characters are retained
  • All the words are lemmatized and any stop words are removed
def basic_clean(sentence):
    wnl = nltk.stem.WordNetLemmatizer()
    stop_words = nltk.corpus.stopwords.words('english')
    text_norm = (unicodedata.normalize('NFKD',sentence)
    .encode('ascii','ignore')
    .decode('utf-8','ignore')
    .lower())
    words = re.sub(r'[^\w\s]','', text_norm).split()
    txt =  [wnl.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join([str(elem) for elem in txt])

By calling the function basic_clean() we will obtain a new column that contains an array of words that we need

data['Cleaned'] = data['Review'].apply(lambda x: basic_clean(x))
print(data.head(3))
Score            Date                         Title  \
3       5  August 8, 2020          We love your pizzas!   
9       5  August 6, 2020  Most enjoyable meal in Malta   
32      5   March 6, 2020                        #Osema   

                                               Review Language Detected Lang  \
3   I went to this restaurant yesterday evening wi...       en            en   
9   Came to this little restaurant by accident, ve...       en            en   
32  Good and reasonable food nice ambient highly r...       en            en   

                                              Cleaned  
3   went restaurant yesterday evening friend order...  
9   came little restaurant accident glad busy luck...  
32  good reasonable food nice ambient highly recom...  

 

This new data set can be used to generate word clouds, or n-grams.

text = text = " ".join(review for review in data.Cleaned)
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

That concludes our scraping and exploration series. See you for the next one.