Web Scraping Basics with BeautifulSoup

Web scraping is like sending a robot to read websites and copy information for you. Imagine you want to collect all the prices from an online store - instead of manually copying each one, you write a program that automatically extracts the data you need.

**Web scraping** is the automated process of extracting data from websites. Instead of manually copying and pasting information, you write a program that downloads web pages and parses their HTML to retrieve the data you need. Python offers powerful libraries for this: `requests` to fetch web pages and `BeautifulSoup` (from `bs4`) to parse HTML and extract specific elements.

## Installation
Before scraping, install the required libraries:
```bash
pip install requests beautifulsoup4
```

## Basic Workflow
1. Send an HTTP request to a URL using `requests.get()`.
2. Check the response status code (200 means success).
3. Parse the HTML content with `BeautifulSoup(response.text, 'html.parser')`.
4. Use BeautifulSoup methods (`.find()`, `.find_all()`, CSS selectors) to locate the desired data.
5. Extract text, attributes, or nested elements.
6. Store the data (list, dictionary, CSV, JSON, or database).

## BeautifulSoup Basics

### Finding Elements
- `soup.find('tag')` – returns the first matching element.
- `soup.find_all('tag')` – returns a list of all matching elements.
- `soup.find('tag', class_='className')` – find by class (note the underscore).
- `soup.find('tag', id='idName')` – find by id.
- `soup.select('css.selector')` – use CSS selectors (e.g., `div.book > h2`).

### Extracting Data
- `.text` or `.get_text()` – gets the visible text inside an element.
- `['attribute']` – gets the value of an attribute (e.g., `a['href']`).

### Navigating the Parse Tree
- `.parent` – get the parent element.
- `.children` – iterate over direct children.
- `.next_sibling` / `.previous_sibling` – navigate between siblings.

## Handling Dynamic Content
Some websites load data dynamically with JavaScript (e.g., React, Angular). BeautifulSoup cannot execute JavaScript; for those cases, you need tools like **Selenium** or **Playwright** that control a real browser.

## Respecting Website Rules
- Always check `robots.txt` (e.g., `https://example.com/robots.txt`) to see which paths are allowed.
- Set a custom `User-Agent` header to identify your bot.
- Add delays between requests (`time.sleep()`) to avoid overwhelming the server.
- Read the website's terms of service – some prohibit scraping.
- Use official APIs whenever available – they are more reliable and polite.

## Common HTTP Status Codes
- `200` – OK
- `301` / `302` – redirect (requests follows them by default)
- `404` – Not Found
- `403` – Forbidden (you may need headers or authentication)
- `429` – Too Many Requests (slow down)
- `500` – Internal Server Error (server issue, retry later)

## Error Handling in Scraping
Wrap your requests and parsing in `try`-`except` blocks to handle network errors, missing elements, or malformed HTML. Use `response.raise_for_status()` to raise an exception for bad status codes.

## Storing Scraped Data
- **CSV** – good for spreadsheets.
- **JSON** – great for nested data.
- **SQLite** – for larger datasets.

## Limitations and Risks
- Websites can change their HTML structure – your scraper may break.
- IP blocking – use proxies or respect rate limits.
- Legal issues – only scrape public data and respect copyright.

## Practice Exercises
1. Scrape the titles and prices of books from a sample bookstore page.
2. Extract all links (`<a href>`) from a webpage.
3. Scrape a table (like the product table in example 3) and save to CSV.
4. Build a news headline scraper that collects titles, dates, and summaries.
5. Implement polite scraping with delays and a custom User-Agent.

This lesson provides **8 complete examples** (simulated to avoid live requests) covering basic fetching, extracting data, saving to CSV, advanced techniques, error handling, ethics, and a full book scraper project.

# Web Scraping with BeautifulSoup import requests from bs4 import BeautifulSoup import csv import time print("WEB SCRAPING BASICS WITH BEAUTIFULSOUP") print("=" * 60) # Note: For actual web scraping, you may need to install: # pip install requests beautifulsoup4 # Example 1: Basic web page fetching print("\n1. BASIC WEB PAGE FETCHING") print("-" * 30) # Let's use a sample HTML for demonstration sample_html = ''' <!DOCTYPE html> <html> <head> <title>Sample Book Store</title> </head> <body> <h1>Welcome to Our Book Store</h1> <div class="book-list"> <div class="book"> <h2 class="title">Python Programming</h2> <p class="author">John Doe</p> <span class="price">$29.99</span> <a href="/book/python-programming">Details</a> </div> <div class="book"> <h2 class="title">Data Science Basics</h2> <p class="author">Jane Smith</p> <span class="price">$34.99</span> <a href="/book/data-science">Details</a> </div> <div class="book"> <h2 class="title">Web Development</h2> <p class="author">Bob Johnson</p> <span class="price">$24.99</span> <a href="/book/web-dev">Details</a> </div> </div> <div class="footer"> <p>Contact: info@bookstore.com</p> </div> </body> </html> ''' # Parse the HTML soup = BeautifulSoup(sample_html, 'html.parser') print("Page Title:", soup.title.text) print("First h1 tag:", soup.h1.text) # Find all book titles print("\nBook Titles:") book_titles = soup.find_all('h2', class_='title') for i, title in enumerate(book_titles, 1): print(f"{i}. {title.text}") # Example 2: Extracting specific data print("\n\n2. EXTRACTING SPECIFIC DATA") print("-" * 30) # Find all books def extract_book_data(html_content): """Extract book information from HTML""" soup = BeautifulSoup(html_content, 'html.parser') books = [] # Find all book divs book_divs = soup.find_all('div', class_='book') for book in book_divs: # Extract data with error handling title = book.find('h2', class_='title').text if book.find('h2', class_='title') else 'N/A' author = book.find('p', class_='author').text if book.find('p', class_='author') else 'N/A' price = book.find('span', class_='price').text if book.find('span', class_='price') else 'N/A' link = book.find('a')['href'] if book.find('a') else 'N/A' books.append({ 'title': title, 'author': author, 'price': price, 'link': link }) return books books_data = extract_book_data(sample_html) print("Extracted Book Data:") print("=" * 50) for i, book in enumerate(books_data, 1): print(f"\nBook {i}:") print(f" Title: {book['title']}") print(f" Author: {book['author']}") print(f" Price: {book['price']}") print(f" Link: {book['link']}") # Example 3: Working with real website (with caution) print("\n\n3. WORKING WITH REAL WEBSITES") print("-" * 30) print("Note: Always check robots.txt and terms of service!") print("Let's use a public demo site instead of a real one.") # Using a demo site for practice demo_html = ''' <html> <body> <table id="products"> <tr> <th>Product</th> <th>Price</th> <th>Stock</th> </tr> <tr> <td>Laptop</td> <td>$999</td> <td>In Stock</td> </tr> <tr> <td>Mouse</td> <td>$25</td> <td>Out of Stock</td> </tr> <tr> <td>Keyboard</td> <td>$79</td> <td>In Stock</td> </tr> </table> </body> </html> ''' # Parse table data table_soup = BeautifulSoup(demo_html, 'html.parser') table = table_soup.find('table', id='products') print("\nProduct Table:") print("-" * 40) if table: rows = table.find_all('tr') for i, row in enumerate(rows): cols = row.find_all(['th', 'td']) row_data = [col.text.strip() for col in cols] print(f"{row_data[0]:15} {row_data[1]:10} {row_data[2]}") else: print("Table not found") # Example 4: Saving scraped data to CSV print("\n\n4. SAVING DATA TO CSV") print("-" * 30) # Extract product data products = [] if table: rows = table.find_all('tr')[1:] # Skip header row for row in rows: cols = row.find_all('td') if len(cols) >= 3: product = { 'name': cols[0].text.strip(), 'price': cols[1].text.strip(), 'stock': cols[2].text.strip() } products.append(product) # Save to CSV filename = 'products.csv' with open(filename, 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['name', 'price', 'stock'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for product in products: writer.writerow(product) print(f"Saved {len(products)} products to {filename}") print("\nCSV Content:") with open(filename, 'r') as f: print(f.read()) # Example 5: Advanced scraping techniques print("\n\n5. ADVANCED SCRAPING TECHNIQUES") print("-" * 30) # Complex HTML with nested structure complex_html = ''' <div class="news-container"> <article class="news-item featured"> <h2><a href="/news/1">Python 3.12 Released</a></h2> <div class="meta"> <span class="date">2024-03-15</span> <span class="category">Technology</span> <span class="views">1,234 views</span> </div> <p class="summary">The latest version of Python includes new features...</p> <ul class="tags"> <li>Python</li> <li>Programming</li> <li>Update</li> </ul> </article> <article class="news-item"> <h2><a href="/news/2">AI Breakthrough</a></h2> <div class="meta"> <span class="date">2024-03-14</span> <span class="category">Science</span> <span class="views">2,345 views</span> </div> <p class="summary">Researchers announce new AI model...</p> <ul class="tags"> <li>AI</li> <li>Research</li> <li>Machine Learning</li> </ul> </article> </div> ''' # Parse complex structure news_soup = BeautifulSoup(complex_html, 'html.parser') print("News Articles:") print("=" * 40) articles = news_soup.find_all('article', class_='news-item') for i, article in enumerate(articles, 1): # Extract with CSS selectors title = article.find('h2').text.strip() link = article.find('a')['href'] date = article.find('span', class_='date').text if article.find('span', class_='date') else 'N/A' category = article.find('span', class_='category').text if article.find('span', class_='category') else 'N/A' summary = article.find('p', class_='summary').text.strip() if article.find('p', class_='summary') else 'N/A' # Extract tags tags = [tag.text for tag in article.find_all('li')] print(f"\nArticle {i}:") print(f" Title: {title}") print(f" Link: {link}") print(f" Date: {date}") print(f" Category: {category}") print(f" Summary: {summary[:50]}...") print(f" Tags: {', '.join(tags)}") # Check if featured if 'featured' in article.get('class', []): print(" ★ Featured Article") # Example 6: Error handling in web scraping print("\n\n6. ERROR HANDLING IN WEB SCRAPING") print("-" * 30) def safe_scrape(url): """Safely scrape a webpage with error handling""" try: # In real scraping, you would use: # response = requests.get(url, headers={'User-Agent': 'Your Bot'}) # response.raise_for_status() # Check for HTTP errors # For demo, simulate different scenarios scenarios = [ "success", "404_error", "timeout", "parse_error" ] import random scenario = random.choice(scenarios) if scenario == "success": print(f"Successfully fetched {url}") # Parse would happen here return {"status": "success", "data": "Sample data"} elif scenario == "404_error": print(f"Error: Page not found (404) for {url}") return {"status": "error", "message": "Page not found"} elif scenario == "timeout": print(f"Error: Request timed out for {url}") return {"status": "error", "message": "Request timeout"} elif scenario == "parse_error": print(f"Error: Could not parse HTML from {url}") return {"status": "error", "message": "Parsing failed"} except Exception as e: print(f"Unexpected error: {e}") return {"status": "error", "message": str(e)} # Test error handling print("Testing error handling scenarios:") for i in range(3): result = safe_scrape(f"https://example.com/page{i}") print(f"Result: {result['status']}") if result['status'] == "error": print(f" Reason: {result['message']}") print() # Example 7: Web scraping etiquette print("\n\n7. WEB SCRAPING ETIQUETTE") print("-" * 30) print("Important rules for ethical web scraping:") print("1. Check robots.txt (e.g., https://example.com/robots.txt)") print("2. Respect rate limits (add delays between requests)") print("3. Identify your bot with User-Agent header") print("4. Don't overload servers") print("5. Check website's terms of service") print("6. Only scrape publicly available data") print("7. Consider using official APIs if available") # Example of polite scraping with delay def polite_scraper(urls): """Scrape multiple URLs with delays""" scraped_data = [] for i, url in enumerate(urls): print(f"Scraping {url}...") # Simulate request # response = requests.get(url, headers={ # 'User-Agent': 'MyScraperBot/1.0 (educational-purpose)' # }) # Add delay to be polite (2-5 seconds between requests) if i > 0: delay = 3 # seconds print(f"Waiting {delay} seconds to be polite...") # time.sleep(delay) # Process response... scraped_data.append({"url": url, "data": f"Data from {url}"}) return scraped_data # Example 8: Complete web scraping project print("\n\n8. COMPLETE WEB SCRAPING PROJECT") print("-" * 30) class BookScraper: """A simple book scraper for demonstration""" def __init__(self): self.books = [] def scrape_sample_data(self): """Scrape from sample HTML (in real life, this would fetch from URL)""" # Sample data representing a bookstore html_content = ''' <div class="books"> <div class="book"> <h3>Python Cookbook</h3> <p class="author">David Beazley</p> <p class="price">$49.99</p> <p class="rating">★★★★☆ (4.2/5)</p> </div> <div class="book"> <h3>Fluent Python</h3> <p class="author">Luciano Ramalho</p> <p class="price">$44.99</p> <p class="rating">★★★★★ (4.7/5)</p> </div> </div> ''' soup = BeautifulSoup(html_content, 'html.parser') book_divs = soup.find_all('div', class_='book') for book_div in book_divs: book = { 'title': book_div.find('h3').text if book_div.find('h3') else 'N/A', 'author': book_div.find('p', class_='author').text if book_div.find('p', class_='author') else 'N/A', 'price': book_div.find('p', class_='price').text if book_div.find('p', class_='price') else 'N/A', 'rating': book_div.find('p', class_='rating').text if book_div.find('p', class_='rating') else 'N/A' } self.books.append(book) return len(self.books) def display_books(self): """Display all scraped books""" print(f"\nFound {len(self.books)} books:") print("=" * 50) for i, book in enumerate(self.books, 1): print(f"\nBook {i}:") print(f" Title: {book['title']}") print(f" Author: {book['author']}") print(f" Price: {book['price']}") print(f" Rating: {book['rating']}") def save_to_csv(self, filename): """Save books to CSV file""" with open(filename, 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['title', 'author', 'price', 'rating'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(self.books) print(f"\nSaved {len(self.books)} books to {filename}") def filter_by_price(self, max_price): """Filter books by maximum price""" # Extract numeric price affordable_books = [] for book in self.books: # Convert "$49.99" to 49.99 try: price_str = book['price'].replace('$', '').strip() price = float(price_str) if price <= max_price: affordable_books.append(book) except (ValueError, AttributeError): continue return affordable_books # Run the scraper print("Running Book Scraper...") scraper = BookScraper() count = scraper.scrape_sample_data() print(f"Scraped {count} books") scraper.display_books() # Save to CSV scraper.save_to_csv('books.csv') # Filter books print("\nBooks under $45:") affordable = scraper.filter_by_price(45) for book in affordable: print(f"- {book['title']}: {book['price']}")