App & Browser Testing Made Easy

Give your users a seamless experience by testing on 3000+ real devices and browsers. Don't compromise with emulators and simulators

Get Started free
Home Guide Web Scraping using Beautiful Soup

Web Scraping using Beautiful Soup

By Sakshi Pandey, Community Contributor -

What is Web Scraping and Why is it Important?

Web scraping is the act of scraping information from a web application. Where screen scraping allows users to scrape visible data from the webpage, web scraping is able to delve deeper and obtain the HTML code laying under it. 

What is Web Scraping

Web scraping can be used to extract all data from a website or to scrape certain information the user requires. For example instead of scraping an article, all of the reviews of the article, and the ratings a user may instead only scrape the comments in order to gather what the general sentiment is towards the article in question.

Automated web scraping expedites the data gathering process and allows users to gather large amounts of data which can then be used to gain insights. The emphasis on data analysis, sentiment analysis,and machine learning in today’s day and age has made web scraping an invaluable tool for any IT professional. 

What is BeautifulSoup?

Automated web scraping is made possible by packages such as BeautifulSoup and Selenium. BeautifulSoup is a highly powerful python library which can be very helpful in gathering scraping and parsing data from web pages.

The name BeautifulSoup explains the purpose of this package well. It can be used to separate and pull out data required by the user from the soup that HTML and XML files are by creating a tree of python objects. It can pull data through various means such as tags and NavigableString.

Web Scraping data from HTML and XML files using Beautiful Soup 2By using the Selenium Webdriver protocol to write scripts to run across popular browsers such as Chrome, Internet Explorer, Firefox, and Safari, BeautifulSoup can be used with greater efficiency. In conjunction with Selenium BeautifulSoup can be used to perform automated web scraping on a large scale, across multiple web pages and browsers, enabling users to gather larger datasets.

Example: Web Scraping with Beautiful Soup

Before understanding the method to perform Web Scraping using Selenium Python and Beautiful, it is important to have all the prerequisites ready in place. 

Pre-Requisites:

1. Set up a Python Environment. This tutorial uses Python 3.11.4.

2. Install Selenium, the pip package installer is the most efficient method for this and can be used to directly install it from the conda terminal, linux terminal, or anaconda prompt.

pip install selenium

3. Install BeautifulSoup with the pip package installer as well.

pip install beautifulsoup4

4. Download the latest WebDriver for the browser you wish to use, or install webdriver_manager to get the latest webdriver for the browser you wish to use.

pip install webdriver_manager

The versions of the aforementioned packages used for this tutorial are:

  • BeautifulSoup 4.12.2
  • Pandas 2.0.2
  • Selenium 4.10.0
  • Webdriver_Manager 3.8.6

Steps for Web Scraping with Beautiful Soup

Follow the steps below to perform webscraping with Beautiful Soup:

Step 1: Import the packages required for the script.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import re
from webdriver_manager.chrome import ChromeDriverManager

Selenium will be required to automate the chrome browser, and since Selenium uses the webdriver protocol we will require the webdriver_manager package to obtain a ChromeDriver compatible with the version of the browser we’ll be using. Selenium will also be used to scrape the webpage. 

BeautifulSoup is needed to parse the HTML of the webpage. Re is imported in order to use regex to match the user input keyword. Pandas will be used to write our keyword, the matches found, and the number of occurrences into an excel file.

Step 2: Obtain the version of ChromeDriver compatible with the browser being used.

driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Step 3: Take the user’s input for the URL of a webpage to scrape.

val = input("Enter a url:")

wait = WebDriverWait(driver, 10)

driver.get(val)


get_url = driver.current_url
wait.until(EC.url_to_be(val))


if get_url == val:
page_source = driver.page_source

For this example, the user input is: https://www.browserstack.com/guide/cross-browser-testing-on-wix-websites

The driver gets this URL and then a wait command is needed before proceeding to the next step, to ensure that the page is loaded.

Step 4: Use BeautifulSoup to parse the HTML scraped from the webpage.

soup = BeautifulSoup(page_source,features="html.parser")

A soup object is created from the HTML scraped from the webpage. 

Step 5: Parse the soup for User Input Keywords.

multiple=input("Would you like to enter multiple keywords?(Y/N)")

if multiple == "Y":
keywords=[]
matches=[]
len_match=[]
num_keyword=input("How many keywords would you like to enter?")
count=int(num_keyword)
while count != 0:
keyword=input("Enter a keyword to find instances of in the article:")
keywords.append(keyword)
match=soup.body.find_all(string=re.compile(keyword))
matches.append(match)
len_match.append(len(match))
count -= 1
df=pd.DataFrame({"Keyword":pd.Series(keywords),"Number of Matches": pd.Series(len_match),"Matches":pd.Series(matches)})

elif multiple == "N":
keyword=input("Enter a keyword to find instances of in the article:")
matches = soup.body.find_all(string=re.compile(keyword))
len_match = len(matches)
df=pd.DataFrame({"Keyword":pd.Series(keyword),"Number of Matches": pd.Series(len_match), "Matches":pd.Series(matches)})

else:
print("Error, invalid character entered.")

A user input is taken to determine whether the webpage needs to be searched for multiple keywords. If it does then multiple keyword inputs are taken from the user, matches are parsed from the soup object, and the number of matches is determined. If the user doesn’t want to search for multiple keywords then these functions are performed for a singular keyword. The results in both cases are stored in a dataframe. Otherwise an error message is displayed. 

Step 6: Store the data collected into an excel file.

df.to_excel("Keywords.xlsx", index=False)

driver.quit()

Scenario: Write the dataframe into an excel file titled Keywords.xlsx and quit the driver.

Output:

Excel File Output:

Excel File Output of Web Scraping using Beautiful Soup

The keywords, matches found for the keywords, and the number of matches found can be visualized in the excel file.

Web Scraping Ethically

Although web scraping is legal, there are some potential ethical and legal issues that may arise from it. For example copyright infringement, and downloading any information that is obviously meant to be private is an ethical violation. Many academic journals and newspapers require paid subscriptions from users who wish to access their content. 

Downloading these articles and journal papers is a violation, and could lead to serious consequences. Many other problems such as overloading a server with requests and causing the site to slow down or even run out of resources and crash can arise from web scraping.  

Therefore it’s vital to communicate with publishers or website owners to ensure that you’re not violating any policies or rules while web scraping their content.

Tags
Selenium Website Testing

Featured Articles

How to perform Web Scraping using Selenium and Python

Get Current URL in Selenium using Python: Tutorial

App & Browser Testing Made Easy

Seamlessly test across 20,000+ real devices with BrowserStack