Web Scraper Using Python and BeautifulSoup

By Raman Kumar

Updated on Sep 07, 2024

In this tutorial, we'll explain how to build web scraper using Python and BeautifulSoup.

Introduction

Web scraping is the process of extracting data from websites. Python, with its powerful libraries, makes it easy to create web scrapers. In this tutorial, we'll explore how to build a simple web scraper using Python and the BeautifulSoup library. We'll learn how to handle HTML elements and extract the desired data.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python (version 3.6+)
  • BeautifulSoup (for parsing HTML)
  • Requests (for making HTTP requests)
  • Basic Python language knowledge

You can install the required libraries using pip:

pip install beautifulsoup4 requests

Step 1: Setting Up the Environment

First, import the necessary libraries:

import requests
from bs4 import BeautifulSoup
  • requests: Used to make the HTTP requests to fetch web pages.
  • BeautifulSoup: Helps parse the HTML content and extract the required data.

Step 2: Fetching the Webpage Content

Next, we'll fetch the content of the webpage. For demonstration purposes, let's scrape a webpage that contains some simple data like blog posts or news articles.

Here, we’ll use a test website (http://example.com). Replace this with any website you wish to scrape.

# URL of the webpage to scrape
url = 'http://example.com'

# Fetch the content of the webpage
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    webpage_content = response.text
else:
    print("Failed to retrieve the webpage")

In this step:

We send a GET request to the website using requests.get().
We store the webpage content in the webpage_content variable if the request was successful.

Step 3: Parsing the HTML

Once we have the HTML content, we use BeautifulSoup to parse it. We create a BeautifulSoup object to navigate the HTML structure:

# Parse the HTML content
soup = BeautifulSoup(webpage_content, 'html.parser')

# Print the parsed HTML in a readable format
print(soup.prettify())

The soup.prettify() method makes the HTML more readable, allowing us to understand the structure of the page.

Step 4: Extracting Specific Data

Let’s assume we want to extract all the headlines (in <h2> tags) from the webpage. To do this, we’ll use BeautifulSoup's methods to find the tags:

# Find all the h2 tags (headlines) in the page
headlines = soup.find_all('h2')

# Extract and print the text of each headline
for headline in headlines:
    print(headline.text)

In this step:

soup.find_all('h2') retrieves all the <h2> elements from the page.

We loop through each element and print the text using headline.text.

Step 5: Extracting Links from the Page

Often, you may need to extract links (<a> tags) from a webpage. BeautifulSoup makes it simple:

# Find all the links (anchor tags) in the page
links = soup.find_all('a')

# Extract and print the URLs (href attribute) of each link
for link in links:
    href = link.get('href')
    print(href)

This code finds all <a> tags on the page and prints the href attribute of each tag, which contains the URL of the link.

Step 6: Handling Nested HTML Elements

Sometimes, the information you need is nested within other elements. Let’s assume we want to extract blog post titles and their summaries, which are stored in a <div> tag with a class name post.

# Find all div tags with class 'post'
posts = soup.find_all('div', class_='post')

# Loop through each post and extract the title and summary
for post in posts:
    title = post.find('h2').text
    summary = post.find('p').text
    print(f'Title: {title}')
    print(f'Summary: {summary}')
    print('---')

Here, we:

Look for all <div> elements with the class post.
Inside each post, we extract the title from an <h2> tag and the summary from a <p> tag.

Step 7: Saving the Scraped Data

After scraping the data, you might want to store it for further analysis. 

You can save the extracted data into a CSV file using Python’s csv module:

import csv

# Open a CSV file to save the data
with open('scraped_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Summary'])

    # Write each post's title and summary to the CSV file
    for post in posts:
        title = post.find('h2').text
        summary = post.find('p').text
        writer.writerow([title, summary])

In this step, we:

Open a CSV file named scraped_data.csv in write mode.
Write the extracted titles and summaries into the CSV file.

Step 8: Save Scraped Data to a Database

Let's first modify the previous example to store the scraped data (titles and summaries) into an SQLite database.

1.1 Install SQLite

SQLite comes pre-installed with Python. If you’re using another database, like MySQL or PostgreSQL, you’ll need the corresponding library (e.g., pymysql or psycopg2).

1.2 Import SQLite and Create a Database Table

First, we’ll import SQLite and create a table to store the scraped data:

import sqlite3

# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()

# Create a table to store the data
cursor.execute('''
    CREATE TABLE IF NOT EXISTS posts (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        title TEXT NOT NULL,
        summary TEXT NOT NULL
    )
''')

# Commit the changes
conn.commit()

1.3 Insert Scraped Data into the Database

Next, after scraping the data, we’ll insert it into the database.

# Assuming 'posts' is the scraped data from previous steps
for post in posts:
    title = post.find('h2').text
    summary = post.find('p').text

    # Insert the data into the database
    cursor.execute('''
        INSERT INTO posts (title, summary) VALUES (?, ?)
    ''', (title, summary))

# Commit the transaction
conn.commit()

# Close the connection
conn.close()

This code:

Opens a connection to the SQLite database.
Inserts each scraped post (title and summary) into the posts table.

Step 9: Send Scraped Data to a REST API

Now that the data is saved in the database, you can also send it to a REST API using a POST request. Let’s assume we have a REST API endpoint at https://example.com/api/posts.

2.1 Install requests

If you don’t have requests installed yet, you can install it with:

pip install requests

2.2 Sending Data via POST Request

Here’s how to send the scraped data to the REST API:

import requests

# URL of the REST API endpoint
api_url = 'https://example.com/api/posts'

# Loop through the scraped data and send it to the API
for post in posts:
    title = post.find('h2').text
    summary = post.find('p').text

    # Data to send in the POST request
    data = {
        'title': title,
        'summary': summary
    }

    # Send a POST request to the API
    response = requests.post(api_url, json=data)

    # Check the response status
    if response.status_code == 201:  # 201 Created
        print(f"Successfully posted: {title}")
    else:
        print(f"Failed to post: {title}. Status code: {response.status_code}")

This code:

Loops through the scraped data.
Sends the title and summary fields as a JSON payload to the API.
Checks if the API response is successful (HTTP status code 201).

Conclusion

In this tutorial, we’ve covered the basics of web scraping using Python and BeautifulSoup. You’ve learned how to:

  • Fetch webpage content using requests.
  • Parse and navigate HTML with BeautifulSoup.
  • Extract specific data such as headlines, links, and more.

Web scraping is a powerful tool for gathering data from the web. However, always ensure you're abiding by a website’s robots.txt file and terms of service before scraping. Happy scraping!

Checkout our dedicated servers and KVM VPS plans.