Web Scraping Basics with BeautifulSoup

Web scraping is like sending a robot to read websites and copy information for you. Imagine you want to collect all the prices from an online store - instead of manually copying each one, you write a program that automatically extracts the data you need.
**Web scraping** is the automated process of extracting data from websites. Instead of manually copying and pasting information, you write a program that downloads web pages and parses their HTML to retrieve the data you need. Python offers powerful libraries for this: `requests` to fetch web pages and `BeautifulSoup` (from `bs4`) to parse HTML and extract specific elements.

## Installation
Before scraping, install the required libraries:
```bash
pip install requests beautifulsoup4
```

## Basic Workflow
1. Send an HTTP request to a URL using `requests.get()`.
2. Check the response status code (200 means success).
3. Parse the HTML content with `BeautifulSoup(response.text, 'html.parser')`.
4. Use BeautifulSoup methods (`.find()`, `.find_all()`, CSS selectors) to locate the desired data.
5. Extract text, attributes, or nested elements.
6. Store the data (list, dictionary, CSV, JSON, or database).

## BeautifulSoup Basics

### Finding Elements
- `soup.find('tag')` – returns the first matching element.
- `soup.find_all('tag')` – returns a list of all matching elements.
- `soup.find('tag', class_='className')` – find by class (note the underscore).
- `soup.find('tag', id='idName')` – find by id.
- `soup.select('css.selector')` – use CSS selectors (e.g., `div.book > h2`).

### Extracting Data
- `.text` or `.get_text()` – gets the visible text inside an element.
- `['attribute']` – gets the value of an attribute (e.g., `a['href']`).

### Navigating the Parse Tree
- `.parent` – get the parent element.
- `.children` – iterate over direct children.
- `.next_sibling` / `.previous_sibling` – navigate between siblings.

## Handling Dynamic Content
Some websites load data dynamically with JavaScript (e.g., React, Angular). BeautifulSoup cannot execute JavaScript; for those cases, you need tools like **Selenium** or **Playwright** that control a real browser.

## Respecting Website Rules
- Always check `robots.txt` (e.g., `https://example.com/robots.txt`) to see which paths are allowed.
- Set a custom `User-Agent` header to identify your bot.
- Add delays between requests (`time.sleep()`) to avoid overwhelming the server.
- Read the website's terms of service – some prohibit scraping.
- Use official APIs whenever available – they are more reliable and polite.

## Common HTTP Status Codes
- `200` – OK
- `301` / `302` – redirect (requests follows them by default)
- `404` – Not Found
- `403` – Forbidden (you may need headers or authentication)
- `429` – Too Many Requests (slow down)
- `500` – Internal Server Error (server issue, retry later)

## Error Handling in Scraping
Wrap your requests and parsing in `try`-`except` blocks to handle network errors, missing elements, or malformed HTML. Use `response.raise_for_status()` to raise an exception for bad status codes.

## Storing Scraped Data
- **CSV** – good for spreadsheets.
- **JSON** – great for nested data.
- **SQLite** – for larger datasets.

## Limitations and Risks
- Websites can change their HTML structure – your scraper may break.
- IP blocking – use proxies or respect rate limits.
- Legal issues – only scrape public data and respect copyright.

## Practice Exercises
1. Scrape the titles and prices of books from a sample bookstore page.
2. Extract all links (`<a href>`) from a webpage.
3. Scrape a table (like the product table in example 3) and save to CSV.
4. Build a news headline scraper that collects titles, dates, and summaries.
5. Implement polite scraping with delays and a custom User-Agent.

This lesson provides **8 complete examples** (simulated to avoid live requests) covering basic fetching, extracting data, saving to CSV, advanced techniques, error handling, ethics, and a full book scraper project.
# Web Scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import csv
import time

print("WEB SCRAPING BASICS WITH BEAUTIFULSOUP")
print("=" * 60)

# Note: For actual web scraping, you may need to install:
# pip install requests beautifulsoup4

# Example 1: Basic web page fetching
print("\n1. BASIC WEB PAGE FETCHING")
print("-" * 30)

# Let's use a sample HTML for demonstration
sample_html = '''
<!DOCTYPE html>
<html>
<head>
    <title>Sample Book Store</title>
</head>
<body>
    <h1>Welcome to Our Book Store</h1>
    <div class="book-list">
        <div class="book">
            <h2 class="title">Python Programming</h2>
            <p class="author">John Doe</p>
            <span class="price">$29.99</span>
            <a href="/book/python-programming">Details</a>
        </div>
        <div class="book">
            <h2 class="title">Data Science Basics</h2>
            <p class="author">Jane Smith</p>
            <span class="price">$34.99</span>
            <a href="/book/data-science">Details</a>
        </div>
        <div class="book">
            <h2 class="title">Web Development</h2>
            <p class="author">Bob Johnson</p>
            <span class="price">$24.99</span>
            <a href="/book/web-dev">Details</a>
        </div>
    </div>
    <div class="footer">
        <p>Contact: info@bookstore.com</p>
    </div>
</body>
</html>
'''

# Parse the HTML
soup = BeautifulSoup(sample_html, 'html.parser')

print("Page Title:", soup.title.text)
print("First h1 tag:", soup.h1.text)

# Find all book titles
print("\nBook Titles:")
book_titles = soup.find_all('h2', class_='title')
for i, title in enumerate(book_titles, 1):
    print(f"{i}. {title.text}")

# Example 2: Extracting specific data
print("\n\n2. EXTRACTING SPECIFIC DATA")
print("-" * 30)

# Find all books
def extract_book_data(html_content):
    """Extract book information from HTML"""
    soup = BeautifulSoup(html_content, 'html.parser')
    books = []
    
    # Find all book divs
    book_divs = soup.find_all('div', class_='book')
    
    for book in book_divs:
        # Extract data with error handling
        title = book.find('h2', class_='title').text if book.find('h2', class_='title') else 'N/A'
        author = book.find('p', class_='author').text if book.find('p', class_='author') else 'N/A'
        price = book.find('span', class_='price').text if book.find('span', class_='price') else 'N/A'
        link = book.find('a')['href'] if book.find('a') else 'N/A'
        
        books.append({
            'title': title,
            'author': author,
            'price': price,
            'link': link
        })
    
    return books

books_data = extract_book_data(sample_html)

print("Extracted Book Data:")
print("=" * 50)
for i, book in enumerate(books_data, 1):
    print(f"\nBook {i}:")
    print(f"  Title: {book['title']}")
    print(f"  Author: {book['author']}")
    print(f"  Price: {book['price']}")
    print(f"  Link: {book['link']}")

# Example 3: Working with real website (with caution)
print("\n\n3. WORKING WITH REAL WEBSITES")
print("-" * 30)

print("Note: Always check robots.txt and terms of service!")
print("Let's use a public demo site instead of a real one.")

# Using a demo site for practice
demo_html = '''
<html>
<body>
<table id="products">
    <tr>
        <th>Product</th>
        <th>Price</th>
        <th>Stock</th>
    </tr>
    <tr>
        <td>Laptop</td>
        <td>$999</td>
        <td>In Stock</td>
    </tr>
    <tr>
        <td>Mouse</td>
        <td>$25</td>
        <td>Out of Stock</td>
    </tr>
    <tr>
        <td>Keyboard</td>
        <td>$79</td>
        <td>In Stock</td>
    </tr>
    </table>
</body>
</html>
'''

# Parse table data
table_soup = BeautifulSoup(demo_html, 'html.parser')
table = table_soup.find('table', id='products')

print("\nProduct Table:")
print("-" * 40)

if table:
    rows = table.find_all('tr')
    for i, row in enumerate(rows):
        cols = row.find_all(['th', 'td'])
        row_data = [col.text.strip() for col in cols]
        print(f"{row_data[0]:15} {row_data[1]:10} {row_data[2]}")
else:
    print("Table not found")

# Example 4: Saving scraped data to CSV
print("\n\n4. SAVING DATA TO CSV")
print("-" * 30)

# Extract product data
products = []
if table:
    rows = table.find_all('tr')[1:]  # Skip header row
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 3:
            product = {
                'name': cols[0].text.strip(),
                'price': cols[1].text.strip(),
                'stock': cols[2].text.strip()
            }
            products.append(product)

# Save to CSV
filename = 'products.csv'
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['name', 'price', 'stock']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for product in products:
        writer.writerow(product)

print(f"Saved {len(products)} products to {filename}")
print("\nCSV Content:")
with open(filename, 'r') as f:
    print(f.read())

# Example 5: Advanced scraping techniques
print("\n\n5. ADVANCED SCRAPING TECHNIQUES")
print("-" * 30)

# Complex HTML with nested structure
complex_html = '''
<div class="news-container">
    <article class="news-item featured">
        <h2><a href="/news/1">Python 3.12 Released</a></h2>
        <div class="meta">
            <span class="date">2024-03-15</span>
            <span class="category">Technology</span>
            <span class="views">1,234 views</span>
        </div>
        <p class="summary">The latest version of Python includes new features...</p>
        <ul class="tags">
            <li>Python</li>
            <li>Programming</li>
            <li>Update</li>
        </ul>
    </article>
    <article class="news-item">
        <h2><a href="/news/2">AI Breakthrough</a></h2>
        <div class="meta">
            <span class="date">2024-03-14</span>
            <span class="category">Science</span>
            <span class="views">2,345 views</span>
        </div>
        <p class="summary">Researchers announce new AI model...</p>
        <ul class="tags">
            <li>AI</li>
            <li>Research</li>
            <li>Machine Learning</li>
        </ul>
    </article>
</div>
'''

# Parse complex structure
news_soup = BeautifulSoup(complex_html, 'html.parser')

print("News Articles:")
print("=" * 40)

articles = news_soup.find_all('article', class_='news-item')
for i, article in enumerate(articles, 1):
    # Extract with CSS selectors
    title = article.find('h2').text.strip()
    link = article.find('a')['href']
    date = article.find('span', class_='date').text if article.find('span', class_='date') else 'N/A'
    category = article.find('span', class_='category').text if article.find('span', class_='category') else 'N/A'
    summary = article.find('p', class_='summary').text.strip() if article.find('p', class_='summary') else 'N/A'
    
    # Extract tags
    tags = [tag.text for tag in article.find_all('li')]
    
    print(f"\nArticle {i}:")
    print(f"  Title: {title}")
    print(f"  Link: {link}")
    print(f"  Date: {date}")
    print(f"  Category: {category}")
    print(f"  Summary: {summary[:50]}...")
    print(f"  Tags: {', '.join(tags)}")
    
    # Check if featured
    if 'featured' in article.get('class', []):
        print("  ★ Featured Article")

# Example 6: Error handling in web scraping
print("\n\n6. ERROR HANDLING IN WEB SCRAPING")
print("-" * 30)

def safe_scrape(url):
    """Safely scrape a webpage with error handling"""
    try:
        # In real scraping, you would use:
        # response = requests.get(url, headers={'User-Agent': 'Your Bot'})
        # response.raise_for_status()  # Check for HTTP errors
        
        # For demo, simulate different scenarios
        scenarios = [
            "success",
            "404_error",
            "timeout",
            "parse_error"
        ]
        
        import random
        scenario = random.choice(scenarios)
        
        if scenario == "success":
            print(f"Successfully fetched {url}")
            # Parse would happen here
            return {"status": "success", "data": "Sample data"}
            
        elif scenario == "404_error":
            print(f"Error: Page not found (404) for {url}")
            return {"status": "error", "message": "Page not found"}
            
        elif scenario == "timeout":
            print(f"Error: Request timed out for {url}")
            return {"status": "error", "message": "Request timeout"}
            
        elif scenario == "parse_error":
            print(f"Error: Could not parse HTML from {url}")
            return {"status": "error", "message": "Parsing failed"}
            
    except Exception as e:
        print(f"Unexpected error: {e}")
        return {"status": "error", "message": str(e)}

# Test error handling
print("Testing error handling scenarios:")
for i in range(3):
    result = safe_scrape(f"https://example.com/page{i}")
    print(f"Result: {result['status']}")
    if result['status'] == "error":
        print(f"  Reason: {result['message']}")
    print()

# Example 7: Web scraping etiquette
print("\n\n7. WEB SCRAPING ETIQUETTE")
print("-" * 30)

print("Important rules for ethical web scraping:")
print("1. Check robots.txt (e.g., https://example.com/robots.txt)")
print("2. Respect rate limits (add delays between requests)")
print("3. Identify your bot with User-Agent header")
print("4. Don't overload servers")
print("5. Check website's terms of service")
print("6. Only scrape publicly available data")
print("7. Consider using official APIs if available")

# Example of polite scraping with delay
def polite_scraper(urls):
    """Scrape multiple URLs with delays"""
    scraped_data = []
    
    for i, url in enumerate(urls):
        print(f"Scraping {url}...")
        
        # Simulate request
        # response = requests.get(url, headers={
        #     'User-Agent': 'MyScraperBot/1.0 (educational-purpose)'
        # })
        
        # Add delay to be polite (2-5 seconds between requests)
        if i > 0:
            delay = 3  # seconds
            print(f"Waiting {delay} seconds to be polite...")
            # time.sleep(delay)
        
        # Process response...
        scraped_data.append({"url": url, "data": f"Data from {url}"})
    
    return scraped_data

# Example 8: Complete web scraping project
print("\n\n8. COMPLETE WEB SCRAPING PROJECT")
print("-" * 30)

class BookScraper:
    """A simple book scraper for demonstration"""
    
    def __init__(self):
        self.books = []
    
    def scrape_sample_data(self):
        """Scrape from sample HTML (in real life, this would fetch from URL)"""
        # Sample data representing a bookstore
        html_content = '''
        <div class="books">
            <div class="book">
                <h3>Python Cookbook</h3>
                <p class="author">David Beazley</p>
                <p class="price">$49.99</p>
                <p class="rating">★★★★☆ (4.2/5)</p>
            </div>
            <div class="book">
                <h3>Fluent Python</h3>
                <p class="author">Luciano Ramalho</p>
                <p class="price">$44.99</p>
                <p class="rating">★★★★★ (4.7/5)</p>
            </div>
        </div>
        '''
        
        soup = BeautifulSoup(html_content, 'html.parser')
        book_divs = soup.find_all('div', class_='book')
        
        for book_div in book_divs:
            book = {
                'title': book_div.find('h3').text if book_div.find('h3') else 'N/A',
                'author': book_div.find('p', class_='author').text if book_div.find('p', class_='author') else 'N/A',
                'price': book_div.find('p', class_='price').text if book_div.find('p', class_='price') else 'N/A',
                'rating': book_div.find('p', class_='rating').text if book_div.find('p', class_='rating') else 'N/A'
            }
            self.books.append(book)
        
        return len(self.books)
    
    def display_books(self):
        """Display all scraped books"""
        print(f"\nFound {len(self.books)} books:")
        print("=" * 50)
        
        for i, book in enumerate(self.books, 1):
            print(f"\nBook {i}:")
            print(f"  Title:  {book['title']}")
            print(f"  Author: {book['author']}")
            print(f"  Price:  {book['price']}")
            print(f"  Rating: {book['rating']}")
    
    def save_to_csv(self, filename):
        """Save books to CSV file"""
        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['title', 'author', 'price', 'rating']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            
            writer.writeheader()
            writer.writerows(self.books)
        
        print(f"\nSaved {len(self.books)} books to {filename}")
    
    def filter_by_price(self, max_price):
        """Filter books by maximum price"""
        # Extract numeric price
        affordable_books = []
        for book in self.books:
            # Convert "$49.99" to 49.99
            try:
                price_str = book['price'].replace('$', '').strip()
                price = float(price_str)
                if price <= max_price:
                    affordable_books.append(book)
            except (ValueError, AttributeError):
                continue
        
        return affordable_books

# Run the scraper
print("Running Book Scraper...")
scraper = BookScraper()
count = scraper.scrape_sample_data()
print(f"Scraped {count} books")

scraper.display_books()

# Save to CSV
scraper.save_to_csv('books.csv')

# Filter books
print("\nBooks under $45:")
affordable = scraper.filter_by_price(45)
for book in affordable:
    print(f"- {book['title']}: {book['price']}")

→ Run this code interactively