Web Scraping with Playwright

Today, data drives everything, but how do you extract what matters? Web scraping is key to gathering valuable insights from complex datasets.

One of the most popular tools for web scraping is Playwright, an open-source automation framework developed by Microsoft’s team.

Overview

How Playwright helps in Web Scraping?

Automates Browsing
Handles Dynamic Content
Bypasses Anti-Scraping Measures
Extracts Data Easily
Supports Headless Mode
Manages Sessions & Cookies
Cross-Browser Support

Playwright’s Web Scraping Capabilities:

Navigating Web Pages with Playwright
Locating Elements
Scraping Text
Scraping Images
Handling Dynamic Content
Interacting with Web Pages
Handling Authentication and Sessions
Downloading and Uploading Files
Handling AJAX Requests and APIs
Running Playwright with Headless Browsers

This blog details about Playwright scraping, its key concepts, and working, and compares it with other popular tools like Selenium and Puppeteer.

What is Web Scraping?

Web scraping is the process of extracting data from websites. This data can range from text and images to entire databases, and it is commonly used in research, data analysis, and competitive intelligence.

In web scraping, scripts automatically access web pages, retrieve data, and store it in a structured format, such as a CSV or database.

Why is Web Scraping done?

Web scraping helps for a variety of purposes, such as:

Data extraction for research and analysis: Scraping allows businesses and individuals to gather large datasets from publicly available web pages.
Price monitoring: E-commerce platforms use scraping to track competitor pricing.
Market research: Scraping provides insights into trends, product performance, and consumer sentiment.
SEO analysis: Web scraping helps in analyzing keyword usage and content performance across different websites.

Read More: Web Scraping using Beautiful Soup

What is Playwright?

Playwright is an open-source browser automation framework developed by Microsoft. It is designed to automate web interactions and browser tasks. It improves the browser experience, from page interactions to network activity, making it a powerful tool for web scraping.

It works across multiple browsers, including Chromium, Firefox, and WebKit, and it is an efficient solution for testing and scraping.

Installation

To work with Playwright, install the Playwright library. Here’s how to initiate the installation:

1. Python:

a) Install Playwright via pip:

pip install playwright

b) Then, install the necessary browser binaries:

python -m playwright install

2. Node.js:

a) Use npm to install Playwright:

npm install playwright

b) After installation, install the required browser binaries:

npx playwright install

Setup

Once Playwright is installed, write scripts to automate browsers. It works with both Python and JavaScript (Node.js).

Python:

from playwright.sync_api import sync_playwright



with sync_playwright() as p:

browser = p.chromium.launch()

page = browser.new_page()

page.goto("https://example.com")

browser.close()

Node.js:

const { chromium } = require('playwright');



(async () => {

  const browser = await chromium.launch();

  const page = await browser.newPage();

  await page.goto('https://example.com');

  await browser.close();

})();

Playwright vs Selenium vs Puppeteer

Here are some of the common differences between Playwright, Selenium, and Puppeteer with different features:

Aspect	Playwright	Selenium	Puppeteer
Browser Support	Chromium, Firefox, WebKit	Chromium, Firefox, Safari	Chromium
Language Support	Python, Node.js, C#, Java	Python, Java, C#, Ruby, JavaScript	Node.js
Headless Mode	Yes	Yes	Yes
Speed	Fast	Slower	Fast
API Access	More advanced	Basic	Advanced (but limited)

How does Playwright help in Web Scraping?

Here is how Playwright helps in web scraping:

Automates Browsing: Simulates real user interactions like clicking, typing, and form submission.
Handles Dynamic Content: Waits for AJAX-loaded elements to appear before extracting data.
Bypasses Anti-Scraping Measures: Supports proxy rotation, user-agent spoofing, and CAPTCHA solving.
Extracts Data Easily: Retrieves text, images, and attributes using powerful selectors.
Supports Headless Mode: Runs in headless browsers for faster and stealthier scraping.
Manages Sessions & Cookies: Maintains authentication and session states for scraping logged-in pages.
Cross Browser Testing Support: Works with Chromium, Firefox, and WebKit for better compatibility.

Steps to perform Web Scraping using Playwright

Here are some of the general steps to be used for web scraping:

Step 1: Install Playwright

Python:

pip install playwright

python -m playwright install

Node.js:

npm install playwright

npx playwright install

Step 2: Initialize a Browser Instance

Python:

from playwright.sync_api import sync_playwright



with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

           page = browser.new_page()

page.goto('https://example.com')

Node.js:

const { chromium } = require('playwright');



(async () => {

  const browser = await chromium.launch({ headless: true });

  const page = await browser.newPage();

  await page.goto('https://example.com');

})();

Step 3: Interact with the Web Page

For performing actions like clicking a button:

Python:

page.click('button#loadMore')

Node.js:

await page.click('button#loadMore');

Step 4: Extract Data from the Web Page

To scrape all the headings (<h1>) on a page:

Python:

headings = page.query_selector_all('h1')

for heading in headings:

print(heading.inner_text())

Node.js:

const headings = await page.locator('h1');

for (let i = 0; i < await headings.count(); i++) {

  console.log(await headings.nth(i).innerText());

}

Step 5: Extracting Multiple Elements

To scrap all the links on a page:

Python:

links = page.query_selector_all('a')

for link in links:

print(link.get_attribute('href'))

Node.js:

const links = await page.locator('a');

for (let i = 0; i < await links.count(); i++) {

  console.log(await links.nth(i).getAttribute('href'));

}

Playwright’s Web Scraping Capabilities

1. Navigating Web Pages with Playwright

One of the most simplest tasks with Playwright is to navigate to a webpage and perform actions like clicking links or filling out forms.

Python Example:

from playwright.sync_api import sync_playwright



with sync_playwright() as p:

browser = p.chromium.launch()

page = browser.new_page()

page.goto("https://example.com")

page.click("a#next")

browser.close()

Node.js Example:

const { chromium } = require('playwright');



(async () => {

  const browser = await chromium.launch();

  const page = await browser.newPage();

  await page.goto('https://example.com');

  await page.click('a#next');

  await browser.close();

})();

2. Locating Elements

There are various methods to locate elements such as page.querySelector() or page.locator().

Python:

element = page.query_selector('div.content')

print(element.inner_text())

Node.js:

const element = await page.locator('div.content');

console.log(await element.innerText());

3. Scraping Text

Python:

text = page.query_selector('h1').inner_text()

print(text)

Node.js:

const text = await page.locator('h1').innerText();

console.log(text);

4. Scraping Images

Python:

image_url = page.query_selector('img').get_attribute('src')

print(image_url)

Node.js:

const imageUrl = await page.locator('img').getAttribute('src');

console.log(imageUrl);

5. Handling Dynamic Content

Python:

page.wait_for_selector('div.dynamic-content')

6. Interacting with Web Pages

Python:

page.fill('input[name="username"]', 'test_user')

page.click('button[type="submit"]')

7. Handling Authentication and Sessions

Python:

page.goto("https://example.com", auth={"username": "user", "password": "pass"})

8. Downloading and Uploading Files

Python:

# Download a file

page.click('a#download')

# Upload a file

page.set_input_files('input[type="file"]', 'path/to/file')

9. Handling AJAX Requests and APIs

Python:

page.on('route', lambda route: route.continue_())

10. Running Playwright with Headless Browsers

Python:

browser = p.chromium.launch(headless=True)

Bypassing Anti-Scraping Mechanisms Using Playwright

Websites use different anti-scraping mechanisms to prevent automated bots from accessing their data. Playwright provides several strategies to bypass these, including:

Using Randomized User Agents: Playwright allows you to change user agents dynamically to mimic real users and avoid detection.
Emulating Mobile Devices: It can simulate different mobile devices, screen sizes, and touch inputs to blend in with genuine traffic.
Managing IP Rotation and Proxy Handling: Playwright supports proxy servers and IP rotation to prevent IP bans and reduce the chances of being blocked.
Handling CAPTCHAs (via integration with services like 2Captcha): Integrate CAPTCHA-solving services to automate solving challenges and continue scraping uninterrupted.

Talk to an Expert

Best Practices for Playwright Web Scraping

Here are some of the best practices for Playwright Web Scraping:

Check for the robots.txt: Always check if the website allows scraping.
Be Polite through delays: Introduce delays between requests to avoid overloading servers.
Rotate IPs: Use proxies if scraping large amounts of data.
Monitor for Changes: Websites often update their structure, so ensure that scraping logic can handle such changes.

Why test Playwright Scripts on Real Devices?

Testing Playwright scripts on real devices is important to ensure accurate and reliable performance across different platforms and environments.

This provides a true representation of how scripts behave in production, as simulators and emulators may not fully replicate the performance, interactions, or limitations of actual hardware.

Accurate rendering: Real devices offer precise behavior in terms of rendering and UI interactions.
Performance validation: Testing on real devices provides more efficiency under real-world conditions.
Compatibility checks: Verify that the scripts work perfectly across different operating systems, screen sizes, and resolutions.

Why choose BrowserStack to run Playwright Tests?

BrowserStack provides a cloud-based platform that allows running Playwright tests on real devices and browsers, offering several key advantages to improve testing outcomes and test reliability.

Here’s why to consider using BrowserStack Automate for running Playwright tests:

Real Devices: With BrowserStack Automate, run your Playwright scripts on a wide range of real devices, ensuring accurate testing across different operating systems, screen sizes, and resolutions.
Parallel Test Execution: BrowserStack supports parallel test execution allowing to run multiple Playwright tests across different devices and browsers for improving the efficiency.
CI/CD Integration: Easily integrate BrowserStack with CI/CD pipelines (for example, Jenkins, CircleCI, GitHub Actions). This helps get faster feedback and improves software quality.
No In-house Device Maintenance/Cost: By using BrowserStack’s cloud infrastructure, there’s no need to maintain a physical lab of devices or handle the associated costs.

Conclusion

Playwright is a powerful tool for web scraping, with the feature of handling dynamic content, bypassing anti-scraping mechanisms, and interacting with websites just like a real user.

For a smoother and more efficient scraping experience, BrowserStack Automate offers the ability to test Playwright scripts across real devices, enabling parallel execution, and seamless CI/CD integration.

Try BrowserStack Now

Automation Tests on Real Devices & Browsers

Seamlessly Run Automation Tests on 3500+ real Devices & Browsers

Web Scraping with Playwright