by Robert Abela
This article is part of a series that goes through all the steps needed to write a script that automatically scrapes information from a website. The first article in this series was an Introduction to Python scraping.
This article is about scraping TripAdvisor reviews using Selenium. TripAdvisor is a large travel platform with hundreds of millions of reviews on numerous things including restaurants. This platform provides an API to programmatically read data. The API is highly controlled allowing a limited number of API keys and does not allow access to the Content API for purposes of data analysis, academic research and any use not associated with a consumer-facing (B2C) travel website or application.
This leaves us with scraping as the only option to download reviews programmatically. We are going to download reviews of a particular restaurant. For the sake of this article, we chose Storie & Sapori – La Valletta. That said, approach and the code described should be applicable to all restaurants on TripAdvisor.
The More link
As highlighted in the screenshot below, TripAdvisor only loads part of the review initially and waits until the user clicks More to load the rest.
<span class="taLnk ulBlueLinks" onclick="widgetEvCall('handlers.clickExpand',event,this);">More</span>
The score given by the reviewer is shown using an image made up of five circles, some of which are filled in green. The number of green circles indicates the score out of 5. There is no textual indication of the score, and this makes it hard to scrape. Upon closer inspection one can see that the CSS class of the SPAN element holds an indication of the score. Basically, if we split the class name using the ‘_’ character, we can keep the substring at index 3 as the score. The examples below will result in scores of 40 and 10 respectively:
<span class="ui_bubble_rating bubble_40"></span> <span class="ui_bubble_rating bubble_10"></span>
Finding the relevant HTML elements
Below is a cleaned HTML DIV element that holds one sample reviewed (Use F12 to turn on Developer Mode in your browser and inspect the code). The relevant parts are labelled with HTML comments.
Saving to CSV file
Once the data is gathered, it will be saved as a CSV file that can be analysed later. Python has built-in code for handling CSV data and the sample below shows how to use it:
import csv csvFile = open("file.csv", "w", newline='', encoding="utf-8") csvWriter = csv.writer(csvFile) csvWriter.writerow(('Heading1','Heading2')) csvWriter.writerow(('row1col1', 'row1col2')) csvWriter.writerow(('row2col1', 'row2col2')) csvFile.close()
A first version of the script that is as simple as possible to get the job done is found below:
import csv import time from selenium import webdriver URL = "https://www.tripadvisor.com/Restaurant_Review-g190328-d8867662-Reviews-Storie_Sapori_La_Valletta-Valletta_Island_of_Malta.html" driver = webdriver.Chrome("./chromedriver") driver.get(URL) # Prepare CSV file csvFile = open("reviews.csv", "w", newline='', encoding="utf-8") csvWriter = csv.writer(csvFile) csvWriter.writerow(['Score','Date','Title','Review']) # Find and click the More link (to load all reviews) driver.find_element_by_xpath("//span[@class='taLnk ulBlueLinks']").click() time.sleep(5) # Wait for reviews to load reviews = driver.find_elements_by_xpath("//div[@class='ui_column is-9']") num_page_items = min(len(reviews), 10) # Loop through the reviews found for i in range(num_page_items): # get the score, date, title and review score_class = reviews[i].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class") score = score_class.split("_") date = reviews[i].find_element_by_xpath(".//span[@class='ratingDate']").get_attribute("title") title = reviews[i].find_element_by_xpath(".//span[@class='noQuotes']").text review = reviews[i].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "") # Save to CSV csvWriter.writerow((score, date, title, review)) # Close CSV file and browser csvFile.close() driver.close()
Once the scraper runs successfully, it will produce a file named
reviews.csv with the structure of the sample below:
Score,Date,Title,Review 50,"August 8, 2020",We love your pizzas!,"I went to this restaurant yesterday evening with some friends, ..." 50,"August 6, 2020",Most enjoyable meal in Malta,"Came to this little restaurant by accident, very glad we did..." 50,"March 6, 2020",#Osema,Good and reasonable food nice ambient highly recommended excellent service will visit again for sure 50,"March 6, 2020",#Osems,Excellent service with good food and very nice and kind staff will not be our last time here...
It was decided to keep this part as simple as possible. The following limitations exist in the current version of the scraper and will be addressed in the next article in the series:
- Only reviews on the first page are saved
- A maximum of 10 reviews (or less) are saved
- TripAdvisor filters out any non-English reviews by default
- The scraper will wait 5 seconds after clicking More, even if the reviews load in less time
- General script maintainability and efficiency can be improved.