Web Scraping Basics with BeautifulSoup
Web scraping is like sending a robot to read websites and copy information for you. Imagine you want to collect all the prices from an online store - instead of manually copying each one, you write a program that automatically extracts the data you need.
**Web scraping** is the automated process of extracting data from websites. Instead of manually copying and pasting information, you write a program that downloads web pages and parses their HTML to retrieve the data you need. Python offers powerful libraries for this: `requests` to fetch web pages and `BeautifulSoup` (from `bs4`) to parse HTML and extract specific elements.
## Installation
Before scraping, install the required libraries:
```bash
pip install requests beautifulsoup4
```
## Basic Workflow
1. Send an HTTP request to a URL using `requests.get()`.
2. Check the response status code (200 means success).
3. Parse the HTML content with `BeautifulSoup(response.text, 'html.parser')`.
4. Use BeautifulSoup methods (`.find()`, `.find_all()`, CSS selectors) to locate the desired data.
5. Extract text, attributes, or nested elements.
6. Store the data (list, dictionary, CSV, JSON, or database).
## BeautifulSoup Basics
### Finding Elements
- `soup.find('tag')` – returns the first matching element.
- `soup.find_all('tag')` – returns a list of all matching elements.
- `soup.find('tag', class_='className')` – find by class (note the underscore).
- `soup.find('tag', id='idName')` – find by id.
- `soup.select('css.selector')` – use CSS selectors (e.g., `div.book > h2`).
### Extracting Data
- `.text` or `.get_text()` – gets the visible text inside an element.
- `['attribute']` – gets the value of an attribute (e.g., `a['href']`).
### Navigating the Parse Tree
- `.parent` – get the parent element.
- `.children` – iterate over direct children.
- `.next_sibling` / `.previous_sibling` – navigate between siblings.
## Handling Dynamic Content
Some websites load data dynamically with JavaScript (e.g., React, Angular). BeautifulSoup cannot execute JavaScript; for those cases, you need tools like **Selenium** or **Playwright** that control a real browser.
## Respecting Website Rules
- Always check `robots.txt` (e.g., `https://example.com/robots.txt`) to see which paths are allowed.
- Set a custom `User-Agent` header to identify your bot.
- Add delays between requests (`time.sleep()`) to avoid overwhelming the server.
- Read the website's terms of service – some prohibit scraping.
- Use official APIs whenever available – they are more reliable and polite.
## Common HTTP Status Codes
- `200` – OK
- `301` / `302` – redirect (requests follows them by default)
- `404` – Not Found
- `403` – Forbidden (you may need headers or authentication)
- `429` – Too Many Requests (slow down)
- `500` – Internal Server Error (server issue, retry later)
## Error Handling in Scraping
Wrap your requests and parsing in `try`-`except` blocks to handle network errors, missing elements, or malformed HTML. Use `response.raise_for_status()` to raise an exception for bad status codes.
## Storing Scraped Data
- **CSV** – good for spreadsheets.
- **JSON** – great for nested data.
- **SQLite** – for larger datasets.
## Limitations and Risks
- Websites can change their HTML structure – your scraper may break.
- IP blocking – use proxies or respect rate limits.
- Legal issues – only scrape public data and respect copyright.
## Practice Exercises
1. Scrape the titles and prices of books from a sample bookstore page.
2. Extract all links (`<a href>`) from a webpage.
3. Scrape a table (like the product table in example 3) and save to CSV.
4. Build a news headline scraper that collects titles, dates, and summaries.
5. Implement polite scraping with delays and a custom User-Agent.
This lesson provides **8 complete examples** (simulated to avoid live requests) covering basic fetching, extracting data, saving to CSV, advanced techniques, error handling, ethics, and a full book scraper project.
# Web Scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import csv
import time
print("WEB SCRAPING BASICS WITH BEAUTIFULSOUP")
print("=" * 60)
# Note: For actual web scraping, you may need to install:
# pip install requests beautifulsoup4
# Example 1: Basic web page fetching
print("\n1. BASIC WEB PAGE FETCHING")
print("-" * 30)
# Let's use a sample HTML for demonstration
sample_html = '''
<!DOCTYPE html>
<html>
<head>
<title>Sample Book Store</title>
</head>
<body>
<h1>Welcome to Our Book Store</h1>
<div class="book-list">
<div class="book">
<h2 class="title">Python Programming</h2>
<p class="author">John Doe</p>
<span class="price">$29.99</span>
<a href="/book/python-programming">Details</a>
</div>
<div class="book">
<h2 class="title">Data Science Basics</h2>
<p class="author">Jane Smith</p>
<span class="price">$34.99</span>
<a href="/book/data-science">Details</a>
</div>
<div class="book">
<h2 class="title">Web Development</h2>
<p class="author">Bob Johnson</p>
<span class="price">$24.99</span>
<a href="/book/web-dev">Details</a>
</div>
</div>
<div class="footer">
<p>Contact: info@bookstore.com</p>
</div>
</body>
</html>
'''
# Parse the HTML
soup = BeautifulSoup(sample_html, 'html.parser')
print("Page Title:", soup.title.text)
print("First h1 tag:", soup.h1.text)
# Find all book titles
print("\nBook Titles:")
book_titles = soup.find_all('h2', class_='title')
for i, title in enumerate(book_titles, 1):
print(f"{i}. {title.text}")
# Example 2: Extracting specific data
print("\n\n2. EXTRACTING SPECIFIC DATA")
print("-" * 30)
# Find all books
def extract_book_data(html_content):
"""Extract book information from HTML"""
soup = BeautifulSoup(html_content, 'html.parser')
books = []
# Find all book divs
book_divs = soup.find_all('div', class_='book')
for book in book_divs:
# Extract data with error handling
title = book.find('h2', class_='title').text if book.find('h2', class_='title') else 'N/A'
author = book.find('p', class_='author').text if book.find('p', class_='author') else 'N/A'
price = book.find('span', class_='price').text if book.find('span', class_='price') else 'N/A'
link = book.find('a')['href'] if book.find('a') else 'N/A'
books.append({
'title': title,
'author': author,
'price': price,
'link': link
})
return books
books_data = extract_book_data(sample_html)
print("Extracted Book Data:")
print("=" * 50)
for i, book in enumerate(books_data, 1):
print(f"\nBook {i}:")
print(f" Title: {book['title']}")
print(f" Author: {book['author']}")
print(f" Price: {book['price']}")
print(f" Link: {book['link']}")
# Example 3: Working with real website (with caution)
print("\n\n3. WORKING WITH REAL WEBSITES")
print("-" * 30)
print("Note: Always check robots.txt and terms of service!")
print("Let's use a public demo site instead of a real one.")
# Using a demo site for practice
demo_html = '''
<html>
<body>
<table id="products">
<tr>
<th>Product</th>
<th>Price</th>
<th>Stock</th>
</tr>
<tr>
<td>Laptop</td>
<td>$999</td>
<td>In Stock</td>
</tr>
<tr>
<td>Mouse</td>
<td>$25</td>
<td>Out of Stock</td>
</tr>
<tr>
<td>Keyboard</td>
<td>$79</td>
<td>In Stock</td>
</tr>
</table>
</body>
</html>
'''
# Parse table data
table_soup = BeautifulSoup(demo_html, 'html.parser')
table = table_soup.find('table', id='products')
print("\nProduct Table:")
print("-" * 40)
if table:
rows = table.find_all('tr')
for i, row in enumerate(rows):
cols = row.find_all(['th', 'td'])
row_data = [col.text.strip() for col in cols]
print(f"{row_data[0]:15} {row_data[1]:10} {row_data[2]}")
else:
print("Table not found")
# Example 4: Saving scraped data to CSV
print("\n\n4. SAVING DATA TO CSV")
print("-" * 30)
# Extract product data
products = []
if table:
rows = table.find_all('tr')[1:] # Skip header row
for row in rows:
cols = row.find_all('td')
if len(cols) >= 3:
product = {
'name': cols[0].text.strip(),
'price': cols[1].text.strip(),
'stock': cols[2].text.strip()
}
products.append(product)
# Save to CSV
filename = 'products.csv'
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['name', 'price', 'stock']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for product in products:
writer.writerow(product)
print(f"Saved {len(products)} products to {filename}")
print("\nCSV Content:")
with open(filename, 'r') as f:
print(f.read())
# Example 5: Advanced scraping techniques
print("\n\n5. ADVANCED SCRAPING TECHNIQUES")
print("-" * 30)
# Complex HTML with nested structure
complex_html = '''
<div class="news-container">
<article class="news-item featured">
<h2><a href="/news/1">Python 3.12 Released</a></h2>
<div class="meta">
<span class="date">2024-03-15</span>
<span class="category">Technology</span>
<span class="views">1,234 views</span>
</div>
<p class="summary">The latest version of Python includes new features...</p>
<ul class="tags">
<li>Python</li>
<li>Programming</li>
<li>Update</li>
</ul>
</article>
<article class="news-item">
<h2><a href="/news/2">AI Breakthrough</a></h2>
<div class="meta">
<span class="date">2024-03-14</span>
<span class="category">Science</span>
<span class="views">2,345 views</span>
</div>
<p class="summary">Researchers announce new AI model...</p>
<ul class="tags">
<li>AI</li>
<li>Research</li>
<li>Machine Learning</li>
</ul>
</article>
</div>
'''
# Parse complex structure
news_soup = BeautifulSoup(complex_html, 'html.parser')
print("News Articles:")
print("=" * 40)
articles = news_soup.find_all('article', class_='news-item')
for i, article in enumerate(articles, 1):
# Extract with CSS selectors
title = article.find('h2').text.strip()
link = article.find('a')['href']
date = article.find('span', class_='date').text if article.find('span', class_='date') else 'N/A'
category = article.find('span', class_='category').text if article.find('span', class_='category') else 'N/A'
summary = article.find('p', class_='summary').text.strip() if article.find('p', class_='summary') else 'N/A'
# Extract tags
tags = [tag.text for tag in article.find_all('li')]
print(f"\nArticle {i}:")
print(f" Title: {title}")
print(f" Link: {link}")
print(f" Date: {date}")
print(f" Category: {category}")
print(f" Summary: {summary[:50]}...")
print(f" Tags: {', '.join(tags)}")
# Check if featured
if 'featured' in article.get('class', []):
print(" ★ Featured Article")
# Example 6: Error handling in web scraping
print("\n\n6. ERROR HANDLING IN WEB SCRAPING")
print("-" * 30)
def safe_scrape(url):
"""Safely scrape a webpage with error handling"""
try:
# In real scraping, you would use:
# response = requests.get(url, headers={'User-Agent': 'Your Bot'})
# response.raise_for_status() # Check for HTTP errors
# For demo, simulate different scenarios
scenarios = [
"success",
"404_error",
"timeout",
"parse_error"
]
import random
scenario = random.choice(scenarios)
if scenario == "success":
print(f"Successfully fetched {url}")
# Parse would happen here
return {"status": "success", "data": "Sample data"}
elif scenario == "404_error":
print(f"Error: Page not found (404) for {url}")
return {"status": "error", "message": "Page not found"}
elif scenario == "timeout":
print(f"Error: Request timed out for {url}")
return {"status": "error", "message": "Request timeout"}
elif scenario == "parse_error":
print(f"Error: Could not parse HTML from {url}")
return {"status": "error", "message": "Parsing failed"}
except Exception as e:
print(f"Unexpected error: {e}")
return {"status": "error", "message": str(e)}
# Test error handling
print("Testing error handling scenarios:")
for i in range(3):
result = safe_scrape(f"https://example.com/page{i}")
print(f"Result: {result['status']}")
if result['status'] == "error":
print(f" Reason: {result['message']}")
print()
# Example 7: Web scraping etiquette
print("\n\n7. WEB SCRAPING ETIQUETTE")
print("-" * 30)
print("Important rules for ethical web scraping:")
print("1. Check robots.txt (e.g., https://example.com/robots.txt)")
print("2. Respect rate limits (add delays between requests)")
print("3. Identify your bot with User-Agent header")
print("4. Don't overload servers")
print("5. Check website's terms of service")
print("6. Only scrape publicly available data")
print("7. Consider using official APIs if available")
# Example of polite scraping with delay
def polite_scraper(urls):
"""Scrape multiple URLs with delays"""
scraped_data = []
for i, url in enumerate(urls):
print(f"Scraping {url}...")
# Simulate request
# response = requests.get(url, headers={
# 'User-Agent': 'MyScraperBot/1.0 (educational-purpose)'
# })
# Add delay to be polite (2-5 seconds between requests)
if i > 0:
delay = 3 # seconds
print(f"Waiting {delay} seconds to be polite...")
# time.sleep(delay)
# Process response...
scraped_data.append({"url": url, "data": f"Data from {url}"})
return scraped_data
# Example 8: Complete web scraping project
print("\n\n8. COMPLETE WEB SCRAPING PROJECT")
print("-" * 30)
class BookScraper:
"""A simple book scraper for demonstration"""
def __init__(self):
self.books = []
def scrape_sample_data(self):
"""Scrape from sample HTML (in real life, this would fetch from URL)"""
# Sample data representing a bookstore
html_content = '''
<div class="books">
<div class="book">
<h3>Python Cookbook</h3>
<p class="author">David Beazley</p>
<p class="price">$49.99</p>
<p class="rating">★★★★☆ (4.2/5)</p>
</div>
<div class="book">
<h3>Fluent Python</h3>
<p class="author">Luciano Ramalho</p>
<p class="price">$44.99</p>
<p class="rating">★★★★★ (4.7/5)</p>
</div>
</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
book_divs = soup.find_all('div', class_='book')
for book_div in book_divs:
book = {
'title': book_div.find('h3').text if book_div.find('h3') else 'N/A',
'author': book_div.find('p', class_='author').text if book_div.find('p', class_='author') else 'N/A',
'price': book_div.find('p', class_='price').text if book_div.find('p', class_='price') else 'N/A',
'rating': book_div.find('p', class_='rating').text if book_div.find('p', class_='rating') else 'N/A'
}
self.books.append(book)
return len(self.books)
def display_books(self):
"""Display all scraped books"""
print(f"\nFound {len(self.books)} books:")
print("=" * 50)
for i, book in enumerate(self.books, 1):
print(f"\nBook {i}:")
print(f" Title: {book['title']}")
print(f" Author: {book['author']}")
print(f" Price: {book['price']}")
print(f" Rating: {book['rating']}")
def save_to_csv(self, filename):
"""Save books to CSV file"""
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['title', 'author', 'price', 'rating']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(self.books)
print(f"\nSaved {len(self.books)} books to {filename}")
def filter_by_price(self, max_price):
"""Filter books by maximum price"""
# Extract numeric price
affordable_books = []
for book in self.books:
# Convert "$49.99" to 49.99
try:
price_str = book['price'].replace('$', '').strip()
price = float(price_str)
if price <= max_price:
affordable_books.append(book)
except (ValueError, AttributeError):
continue
return affordable_books
# Run the scraper
print("Running Book Scraper...")
scraper = BookScraper()
count = scraper.scrape_sample_data()
print(f"Scraped {count} books")
scraper.display_books()
# Save to CSV
scraper.save_to_csv('books.csv')
# Filter books
print("\nBooks under $45:")
affordable = scraper.filter_by_price(45)
for book in affordable:
print(f"- {book['title']}: {book['price']}")
→ Run this code interactively