by Robert Abela

This article is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. Make sure that all the pre-requisites (at the end of this article) are in place before continuing.

Installing Selenium and other requirements

Selenium setup requires two steps:

  1. Install the Selenium library using the command: pip install selenium
  2. Download the Selenium WebDriver for your browser (exact version)

Chrome drivers can be found on chromium.org

Scraper setup requires two commands:

  1. pip install requests
  2. pip install beautifulsoup4

Scraping a website

What is web scraping?

Scraping is like browsing to a website and copying some content, but it is done programmatically (e.g. using Python) which means that it is much faster. The limit to how fast you can scrape is basically your bandwidth and computing power (and how much the web server allows you to). Technically this process can be divided in two parts:

  1. Crawling is the first part, which basically involves opening a page and finding all the interesting links in it, e.g. shops listed in a section of the yellow pages.
  2. Scraping comes next, where all the links from the previous step are visited to extract specific parts of the web page, e.g. the address or phone number.

Challenges of Scraping

One main challenge is that websites tend to be varied and you will likely end up writing a scraper specific to every site you are dealing with. Even if you stick with the same websites, updates/re-designs will likely break your scraper in some way (you will be using the F12 button frequently).

Some websites do not tolerate being scraped and will employ different techniques to slow or stop scraping. Another aspect to consider is the legality of this process, which depends on where the server is located, the term of service and what you do once you have the data amongst other things.

An Alternative to web scraping, when available, are Application Programming Interfaces (APIs) which offer a way to access structured data directly (using formats like JSON and XML) without dealing with the visual presentation of the web pages. Hence it is always a good idea to check if the website offers an API before investing time and effort in a scraper.

Scraping libraries

While there are many ways how to get data from web pages (e.g. using Excel, browser plugins or other tools) this article will focus on how to do it with Python. Having the flexibility of a programming language makes it a very powerful approach and there are very good libraries available such as Beautiful Soup which will be used in the sample below. There is a very good write up on how to Build a Web Scraper With Beautiful Soup. Another framework to consider is Scrapy.

What is Selenium? Why is it needed?

Some websites use JavaScript to load parts of the page later (not directly when the page is loaded). Some links are also calling JavaScript and the URL to go to is computed on click. These techniques are becoming increasingly common (even as an anti-scraping technique) and unfortunately libraries like Beautiful Soup do not handle them well.

In comes Selenium, a framework designed primarily for automated web applications testing. It allows developers to programmatically control a browser using different programming languages. Since with Selenium there is a real browser rendering the page, JavaScript is executed normally and the problems mentioned above can be avoided. This of course requires more resources and makes the whole process slower, so it is wise to use it only when strictly required.

Beautiful Soup and Selenium can also be used together as shown in this interesting article at freecodecamp.org.

Building a first scraper

This first scraper will perform the following steps:

  • Visit the page and parse source HTML
  • Check that the page title is as expected
  • Perform a search
  • Look for expected result
  • Get the link URL

Two Implementations, using Beautiful Soup and Selenium, can be found below.

Scraper using Beautiful Soup

import requests
from bs4 import BeautifulSoup

#Visit Page and parse source HTML
page = requests.get("http://www.python.org")
soup = BeautifulSoup(page.content, 'html.parser')

#Check title is as expected
assert "Python" in soup.title.string

##Perform the search using HTTP GET request
page = requests.get("https://www.python.org/search/?q=pip")
soup = BeautifulSoup(page.content, 'html.parser')

#Look for expected result
link = soup.select_one('a:contains("PEP 439")')
#Get the link URL
print (link['href'])

Scraper using Selenium

from selenium import webdriver

#Open browser and visit page
driver = webdriver.Chrome()
driver.get("http://www.python.org")

#Check title is as expected
assert "Python" in driver.title

#Find search field
search_field = driver.find_element_by_id("id-search-field")
#Enter search term
search_field.send_keys("pip")
#Find and click the Search button
search_button = driver.find_element_by_id("submit")
search_button.click()

#Look for expected result
link = driver.find_element_by_partial_link_text("PEP 439")
#Get the link URL
print(link.get_attribute("href"))

#Close browser
driver.close()

Pre-requisites

This is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. This section lists all the technologies you should be familiar with and all the tools that need to be installed.

Basic knowledge of HTML

This article series assumes a basic understanding of web page source code including:

  • Familiarity with Python 3.x
  • HTML document structure
  • Attributes of common HTML elements
  • Basic JavaScript and AJAX
  • CSS classes
  • HTTP request parameters
  • Awareness of lazy-loading techniques

A good place to start is W3Schools.

Installing Python

As part of the pre-requisites, installing the correct version of Python and pip is required. This setup section assumes a Windows operating system, but it should be easily transferable to macOS or Linux.

Which Python version should one use: Python 2 or 3? This might have been a point of discussion in the past (Python 2.7 is the latest version of Python 2.x and was released in 2010) since the two are not compatible, one had to pick a version. However today (2020) it is safe to go with version 3.x, with the latest stable version at the time of writing being 3.8.

Start by downloading the latest version of Python 3 from the official website. Install it as you would with any other software. Make sure you add python to the PATH as shown below.

python installation

To confirm that it was successfully installed open the Command Prompt window and type python, you should see something like the following:

C:\WINDOWS\System32>python
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

Installing and using pip

pip is the package installer for Python. It is very likely that it came along with your Python installation. You can check by entering pip -V in a Command Prompt Window, and you should see something like the following:

C:\WINDOWS\System32>pip -V
pip 20.1.1 from c:\path\to\python\python38-32\lib\site-packages\pip (python 3.8)

If pip is not available, it needs to be installed by following these steps:

  • Download get-pip.py to a folder on your computer.
  • Open a command prompt
  • Navigate to the folder where get-pip.py was saved
  • Run the following command: python get-pip.py