Web Scraping using Beautiful Soup
By Sakshi Pandey, Community Contributor - June 14, 2023
What is Web Scraping and Why is it Important?
Web scraping is the act of scraping information from a web application. Where screen scraping allows users to scrape visible data from the webpage, web scraping is able to delve deeper and obtain the HTML code laying under it.
Web scraping can be used to extract all data from a website or to scrape certain information the user requires. For example instead of scraping an article, all of the reviews of the article, and the ratings a user may instead only scrape the comments in order to gather what the general sentiment is towards the article in question.
Automated web scraping expedites the data gathering process and allows users to gather large amounts of data which can then be used to gain insights. The emphasis on data analysis, sentiment analysis,and machine learning in today’s day and age has made web scraping an invaluable tool for any IT professional.
What is BeautifulSoup?
Automated web scraping is made possible by packages such as BeautifulSoup and Selenium. BeautifulSoup is a highly powerful python library which can be very helpful in gathering scraping and parsing data from web pages.
The name BeautifulSoup explains the purpose of this package well. It can be used to separate and pull out data required by the user from the soup that HTML and XML files are by creating a tree of python objects. It can pull data through various means such as tags and NavigableString.
By using the Selenium Webdriver protocol to write scripts to run across popular browsers such as Chrome, Internet Explorer, Firefox, and Safari, BeautifulSoup can be used with greater efficiency. In conjunction with Selenium BeautifulSoup can be used to perform automated web scraping on a large scale, across multiple web pages and browsers, enabling users to gather larger datasets.
Example: Web Scraping with Beautiful Soup
Before understanding the method to perform Web Scraping using Selenium Python and Beautiful, it is important to have all the prerequisites ready in place.
Pre-Requisites:
1. Set up a Python Environment. This tutorial uses Python 3.11.4.
2. Install Selenium, the pip package installer is the most efficient method for this and can be used to directly install it from the conda terminal, linux terminal, or anaconda prompt.
pip install selenium
3. Install BeautifulSoup with the pip package installer as well.
pip install beautifulsoup4
4. Download the latest WebDriver for the browser you wish to use, or install webdriver_manager to get the latest webdriver for the browser you wish to use.
pip install webdriver_manager
The versions of the aforementioned packages used for this tutorial are:
- BeautifulSoup 4.12.2
- Pandas 2.0.2
- Selenium 4.10.0
- Webdriver_Manager 3.8.6
Steps for Web Scraping with Beautiful Soup
Follow the steps below to perform webscraping with Beautiful Soup:
Step 1: Import the packages required for the script.
from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import pandas as pd import re from webdriver_manager.chrome import ChromeDriverManager
Selenium will be required to automate the chrome browser, and since Selenium uses the webdriver protocol we will require the webdriver_manager package to obtain a ChromeDriver compatible with the version of the browser we’ll be using. Selenium will also be used to scrape the webpage.
BeautifulSoup is needed to parse the HTML of the webpage. Re is imported in order to use regex to match the user input keyword. Pandas will be used to write our keyword, the matches found, and the number of occurrences into an excel file.
Step 2: Obtain the version of ChromeDriver compatible with the browser being used.
driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()))
Step 3: Take the user’s input for the URL of a webpage to scrape.
val = input("Enter a url:") wait = WebDriverWait(driver, 10) driver.get(val) get_url = driver.current_url wait.until(EC.url_to_be(val)) if get_url == val: page_source = driver.page_source
For this example, the user input is: https://www.browserstack.com/guide/cross-browser-testing-on-wix-websites
The driver gets this URL and then a wait command is needed before proceeding to the next step, to ensure that the page is loaded.
Step 4: Use BeautifulSoup to parse the HTML scraped from the webpage.
soup = BeautifulSoup(page_source,features="html.parser")
A soup object is created from the HTML scraped from the webpage.
Step 5: Parse the soup for User Input Keywords.
multiple=input("Would you like to enter multiple keywords?(Y/N)") if multiple == "Y": keywords=[] matches=[] len_match=[] num_keyword=input("How many keywords would you like to enter?") count=int(num_keyword) while count != 0: keyword=input("Enter a keyword to find instances of in the article:") keywords.append(keyword) match=soup.body.find_all(string=re.compile(keyword)) matches.append(match) len_match.append(len(match)) count -= 1 df=pd.DataFrame({"Keyword":pd.Series(keywords),"Number of Matches": pd.Series(len_match),"Matches":pd.Series(matches)}) elif multiple == "N": keyword=input("Enter a keyword to find instances of in the article:") matches = soup.body.find_all(string=re.compile(keyword)) len_match = len(matches) df=pd.DataFrame({"Keyword":pd.Series(keyword),"Number of Matches": pd.Series(len_match), "Matches":pd.Series(matches)}) else: print("Error, invalid character entered.")
A user input is taken to determine whether the webpage needs to be searched for multiple keywords. If it does then multiple keyword inputs are taken from the user, matches are parsed from the soup object, and the number of matches is determined. If the user doesn’t want to search for multiple keywords then these functions are performed for a singular keyword. The results in both cases are stored in a dataframe. Otherwise an error message is displayed.
Step 6: Store the data collected into an excel file.
df.to_excel("Keywords.xlsx", index=False) driver.quit()
Scenario: Write the dataframe into an excel file titled Keywords.xlsx and quit the driver.
Output:
Excel File Output:
The keywords, matches found for the keywords, and the number of matches found can be visualized in the excel file.
Web Scraping Ethically
Although web scraping is legal, there are some potential ethical and legal issues that may arise from it. For example copyright infringement, and downloading any information that is obviously meant to be private is an ethical violation. Many academic journals and newspapers require paid subscriptions from users who wish to access their content.
Downloading these articles and journal papers is a violation, and could lead to serious consequences. Many other problems such as overloading a server with requests and causing the site to slow down or even run out of resources and crash can arise from web scraping.
Therefore it’s vital to communicate with publishers or website owners to ensure that you’re not violating any policies or rules while web scraping their content.