How to Web Scrape Multiple Pages With Python

This article teaches you how to web scrape multiple pages with Python so that you can get the data you want from websites.

This is helpful because data has become the driving force behind the success of many companies. With businesses making more data-driven decisions, the value that data brings to businesses is incalculable. For this reason, web scrapers have been popular as they are tasked with scraping multiple web pages for collecting valuable data from a large set of websites.

A tutorial on how to web scrape multiple pages automatically (instead of manually)

Scraping data manually, by browsing relevant pages, copying their content, and pasting it onto a spreadsheet, can be an exhaustive and time-consuming task. Web scraping data this way is made increasingly difficult if you don’t have the proper knowledge or expertise beforehand.

Fortunately, programming languages like Python have developed pre-built libraries that make it easy for programmers and developers to web scrape multiple pages with Python. One such library is Beautiful Soup, which is used to parse HTML and XHTML documents to extract data.

In this article, we’ll discuss how to set up Python, scrape multiple web pages with Beautiful Soup, and prevent your IP address from getting banned.

Setting Up Beautiful Soup with Python

Before getting started, we need to ensure we have the essential tools to start web scraping. To scrape a web page with Python, you need to install… well, Python, but you will need the latest Python 3 version.

To install Python 3 on your Windows computer:

  1. Go to Python’s official website and download the Python Installer
  2. Find a stable Python 3 version to download
  3. Click on the link that is relevant to your operating system to download Python
  4. After the installation is complete, click on the .exe file
  5. If you want all users on your device to access Python, then check the Install Launcher for All Users option
  6. Check the Add python.exe to PATH option and click Install Now

This will install Python with the default settings on your computer. To verify your installation you can open the command line by clicking on Start and typing “cmd” in the search box. After opening the command line, type in the following:

python --version

Type this and hit enter. The command line should output something like this:

Output
Python 3.10.10

Now that Python is installed, we need to download Beautiful Soup, which, if you remember, is a pre-built library to help developers web scrape with Python.

To install Beautiful Soup, you need to open up the command prompt, and type in the following command:

$ pip install beautifulsoup4

This should install Beautiful Soup. Now, you’d also need a requests library to make HTTP requests, as well as a parser to parse the HTML and XHTML documents.

To install a requests library, type the following into the command prompt:

$ pip install requests

Then, to install the parser:

pip install lxml

How to Scrape a Single Web Page with Beautiful Soup

Before going on to scrape multiple web pages, we should first understand how to scrape a single page – just so you get the hang of using Beautiful Soup.

To get started, we’ll need to build a web scraper by importing the relevant libraries to Python, which is as simple as writing two lines of code:

import requests
from bs4 import BeautifulSoup

See, just two lines of code, as promised.

Now, we’ll dive into the good stuff, scraping our first web page!

Getting the Site’s HTML

To extract the data from a site and display it in a readable format, we’d first have to start by making a request to the website we wish to scrape.

Basically, this is how our command will be broken down:

  1. We make a request to the website and receive a response, which will be recorded in our result variable that’ll “get” the website
  2. We’ll then make a result variable with the .txt method to get the results in, you guessed it, text form
  3. Finally, within our soup variable, we’ll use the lxml parser to get the object with all the data
  4. After attaining the soup data we can print the results as HTML by typing print(soup.prettify())

This is what that all boils down to in Python:

website = 'https://mywebsite.com/blog/sample-page'
result = requests.get(website)

content = result.text

soup = BeautifulSoup(content, 'lxml')

print(soup.prettify())

Analyzing the Site Structure and Code

Scraping entire sets of websites can give us a huge load of data, including a lot of junk we don’t need. That’s why it’s important to analyze the site structure along with the HTML code we extracted in our previous soup to come up with the best strategy to scrape the web page.

To know what to look for, you’d need to identify your goals. What do you wish to accomplish by collecting data from this specific site? For our example, we would identify the heading of the web page. To do that we’d:

  1. Go to the website and open the page we want to scrape
  2. Right-click on the heading of the page and click ‘inspect’

After clicking ‘inspect’, you’ll notice a new tab open within your window, this is the site’s source code. Within the source code, you’ll notice a line of code is highlighted. This is the code for the element we right-clicked on.

Locating Elements in Beautiful Soup

We need to locate the header element in Beautiful Soup so we could specify it for the lxml parser.

Let’s say the heading of the web page was nested inside a div element with the class “page-title”. To find it, we would input the following line of code:

box = soup.find('div', class_='page-title')

Next, if the title is enclosed in an H1 tag, we’ll search for the H1 tag within our previously defined box and use get_text() to extract the text:

title = box.find('h1').get_text()

Exporting the Data

You can export the scraped data into CSV, JSON, or TXT formats. Since the .txt format is the simplest, we’ll go ahead with that:

with open(f'{title}.txt', 'w') as file:
file.write(transcript)

We use the f-string before the .txt format to specify that the name of the file should match the title.

How to Scrape Multiple Web Pages with Beautiful Soup

Now that we’ve successfully managed to scrape 1 web page, we can move on to scraping multiple pages with Python. To do this we must go through the following steps.

Use the For Loop Function

If you’re extracting multiple web pages of a site that labels its pages, Beautiful Soup makes it very easy to scrape these pages. This is because these websites have a simple URL structure typically written like: www.mywebsite.com/page/3/.

In Beautiful Soup, we can scrape the headings from these web pages by looping through each link. To get that done, we’ll first do it for one page by typing the following:

URL = 'https://www.mywebsite.com/page/1/'
req = requests.get(URL)

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs = {'class','head'})

print(titles[4].text)

Looping Through Each Page

After getting the output of one page, we can scrape the titles of all the pages by specifying the number of pages within the same code used above. Here’s how that would look like if we’re scraping the first 10 pages:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.mywebsite.com/page/'

for page in range(1,10):

req = requests.get(URL + str(page) + '/')

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs={'class','head'})

for i in range(4,19):

if page>1:

print(f"{(i-3)+page*15}" + titles[i].text)

else:

print(f"{i-3}" + titles[i].text)

After running the code, the output should display the titles of all the posts on the pages.

Looping Through Different URLs

The above method is ideal for pages that have a simple URL structure and have labeled their pages with numbers, but some websites may not have a straightforward structure, or you might want to scrape multiple pages of different websites. For that, you’d have to use the code above and create a separate script for each page you want to scrape.

As you’d guess, that’s not very effective and also time-consuming, so a better way to do it would be to create a list of the URLs you wish to extract the titles from and loop through them. This will allow us to extract the titles of those pages without having to rewrite the code.

Here’s a code snippet that would achieve just that:

import requests
from bs4 import BeautifulSoup as bs

URL = ['https://www.mywebsite.com/','https://www.mywebsite.com/blog/sample-page']

for url in range(0,2):

req = requests.get(URL[url])

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs={'class','head'})

for i in range(4, 19):

if url+1 > 1:

print(f"{(i - 3) + url * 15}" + titles[i].text)

else:

print(f"{i - 3}" + titles[i].text)

How to Prevent Your IP Address from Getting Banned

For web scraping, you’d expect to create multiple requests until you get the complete set of data you’re trying to achieve. However, it’s possible to get your IP address banned for making multiple requests and overloading the server of a certain website within a short timeframe.

To prevent your IP from getting blacklisted, you’d need to appear like a human to web servers rather than being identified as a bot/crawler.

An effective way to do this would be to web scrape pages in short bursts at random intervals by controlling the crawling rate. By slowing down the crawl rate, it would appear as if a human is crawling the web page rather than a super-speed bot.

So, how can we control the crawl rate? We’ll use the randint() and sleep() functions. First, refer to this piece of code to get an understanding of the randint() and sleep() function:

from time import *
from random import randint

for i in range(0,5):

x = randint(2,5)

print(x)

sleep(x)

Now, what does the above code do? The randint function is responsible for selecting a random integer from the given range, in this case, 0 to 5. The sleep function on the other hand adds a short delay by ceasing the operation after the selection of the first integer.

This will give us the effect of making requests in short random bursts, which would prevent our IP from getting blacklisted.

To use this code from web scraping, you can input this in Python:

import requests
from bs4 import BeautifulSoup as bs

from random import randint

from time import sleep

URL = 'https://www.mywebsite.com/page/'

for page in range(1,10):

req = requests.get(URL + str(page) + '/')

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs={'class','head'})

for i in range(4,19):

if page>1:

print(f"{(i-3)+page*15}" + titles[i].text)

else:

print(f"{i-3}" + titles[i].text)

sleep(randint(2,10))

Key Takeaways

Web scraping with Python can be a great way to collect and analyze valuable data. By using the Beautiful Soup Python library, you can create automated scripts to request a web page or multiple URLs to scrape multiple web pages with Python.

To scrape multiple web pages:

  • Download and install the beautiful soup library in Python
  • Make a list of the URLs and scrape them with For Loop
  • Export the data into a readable format

Making multiple requests in a short time can overload the server of a website which can lead to a ban on your IP address. To prevent this, you can control the crawl rate to regulate the crawling of your script.

Collecting data from multiple web pages can be difficult to arrange and organize into a readable spreadsheet. To get started with web scraping, you can sign up for Sheets Genie to automate your Google Sheets and organize data more efficiently to become a web scraping powerhouse.

Learn more about how Sheets Genie can help you automate web scraping and organize data more efficiently by clicking here.