Build It: Python Web Scraper With AI-Parsed Output

Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.

60 min · Reviewed 2026

The project

We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object. This 'scrape + LLM-extract' pattern beats brittle CSS selectors whenever the page layout could change.

# pyproject.toml dependencies
# httpx, beautifulsoup4, anthropic, pydantic

import asyncio
import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel
from anthropic import AsyncAnthropic

class Story(BaseModel):
    rank: int
    title: str
    url: str | None
    points: int
    comments: int

async def fetch_html(url: str) -> str:
    headers = {"User-Agent": "tendril-scraper/1.0"}
    async with httpx.AsyncClient(headers=headers, timeout=10) as client:
        r = await client.get(url)
        r.raise_for_status()
        return r.textSetup: typed model for a story, async HTML fetcher.

def extract_text_blocks(html: str) -> list[str]:
    soup = BeautifulSoup(html, "html.parser")
    # HN wraps stories in <tr class="athing">; we'll just get the visible text of each row
    blocks = []
    for row in soup.select("tr.athing"):
        sibling = row.find_next_sibling("tr")
        text = row.get_text(" ", strip=True)
        if sibling:
            text += " " + sibling.get_text(" ", strip=True)
        link_tag = row.select_one(".titleline > a")
        url = link_tag.get("href") if link_tag else None
        blocks.append(f"URL={url}\n{text}")
    return blocksBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.

client = AsyncAnthropic()

async def parse_story(block: str) -> Story | None:
    prompt = f"""Extract fields from this Hacker News row as JSON.
Fields: rank (int), title (str), url (str or null), points (int), comments (int).
Return ONLY valid JSON, no preface.

<row>
{block}
</row>"""
    try:
        response = await client.messages.create(
            model="claude-opus-4-7",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}],
        )
        raw = response.content[0].text.strip()
        # strip potential code-fence
        if raw.startswith("```"):
            raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0]
        return Story.model_validate_json(raw)
    except Exception as e:
        print(f"parse failed: {e}")
        return None

async def main():
    html = await fetch_html("https://news.ycombinator.com/")
    blocks = extract_text_blocks(html)[:10]
    stories = await asyncio.gather(*(parse_story(b) for b in blocks))
    for s in filter(None, stories):
        print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)")

asyncio.run(main())Pydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.

Advanced: caching to avoid paying twice

import hashlib
import json
from pathlib import Path

CACHE = Path(".cache")
CACHE.mkdir(exist_ok=True)

async def parse_story_cached(block: str) -> Story | None:
    key = hashlib.sha256(block.encode()).hexdigest()[:16]
    cache_file = CACHE / f"{key}.json"
    if cache_file.exists():
        return Story.model_validate_json(cache_file.read_text())
    result = await parse_story(block)
    if result:
        cache_file.write_text(result.model_dump_json())
    return resultHash the input, check disk first. Cuts your LLM bill in half on reruns.

Mini-exercise

Run the scraper on a different site's RSS feed
Add a 'tags' field (list[str]) to Story and update the prompt
Save results to stories.jsonl (one JSON per line)
Measure: how much did caching save on the second run?

Pure CSS selectors	CSS + LLM extraction
Breaks when layout changes	Survives layout changes
Fast + free	Slower + costs tokens
Great for stable APIs	Great for messy HTML
Example: a JSON endpoint	Example: scraping blog posts

Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-scraper-creators

What is the core idea behind "Build It: Python Web Scraper With AI-Parsed Output"?
1. Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which term best describes a foundational idea in "Build It: Python Web Scraper With AI-Parsed Output"?
1. BeautifulSoup
2. scraping
3. pydantic
4. LLM extraction
A learner studying Build It: Python Web Scraper With AI-Parsed Output would need to understand which concept?
1. scraping
2. pydantic
3. BeautifulSoup
4. LLM extraction
Which of these is directly relevant to Build It: Python Web Scraper With AI-Parsed Output?
1. scraping
2. BeautifulSoup
3. LLM extraction
4. pydantic
Which of the following is a key point about Build It: Python Web Scraper With AI-Parsed Output?
1. Run the scraper on a different site's RSS feed
2. Add a 'tags' field (list[str]) to Story and update the prompt
3. Save results to stories.jsonl (one JSON per line)
4. Measure: how much did caching save on the second run?
Which of these does NOT belong in a discussion of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Save results to stories.jsonl (one JSON per line)
3. Add a 'tags' field (list[str]) to Story and update the prompt
4. Run the scraper on a different site's RSS feed
What is the key insight about "Respect the robots" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Always check a site's robots.txt and Terms. Hacker News is friendly. Many sites are not.
4. upsert
What is the key insight about "Cost check" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. upsert
4. Opus 4.7 is ~$15/M input tokens. Each story's prompt is ~300 tokens. 10 stories ≈ 3k tokens ≈ $0.045.
What is the recommended tip about "Always review AI output" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. AI-generated code can hallucinate APIs, miss edge cases, or introduce subtle bugs.
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which statement accurately describes an aspect of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object.
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
What does working with Build It: Python Web Scraper With AI-Parsed Output typically involve?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
4. upsert
Which best describes the scope of "Build It: Python Web Scraper With AI-Parsed Output"?
1. It is unrelated to ai-coding workflows
2. It applies only to the opposite beginner tier
3. It was deprecated in 2024 and no longer relevant
4. It focuses on Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
1. Advanced: caching to avoid paying twice
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Mini-exercise
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which of the following is a concept covered in Build It: Python Web Scraper With AI-Parsed Output?
1. BeautifulSoup
2. pydantic
3. scraping
4. LLM extraction

← Back to interactive lesson

Tendril · Creators · AI-Assisted Coding

Build It: Python Web Scraper With AI-Parsed Output

Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.

60 min · Reviewed 2026

The project

# pyproject.toml dependencies
# httpx, beautifulsoup4, anthropic, pydantic

import asyncio
import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel
from anthropic import AsyncAnthropic

class Story(BaseModel):
    rank: int
    title: str
    url: str | None
    points: int
    comments: int

async def fetch_html(url: str) -> str:
    headers = {"User-Agent": "tendril-scraper/1.0"}
    async with httpx.AsyncClient(headers=headers, timeout=10) as client:
        r = await client.get(url)
        r.raise_for_status()
        return r.textSetup: typed model for a story, async HTML fetcher.

def extract_text_blocks(html: str) -> list[str]:
    soup = BeautifulSoup(html, "html.parser")
    # HN wraps stories in <tr class="athing">; we'll just get the visible text of each row
    blocks = []
    for row in soup.select("tr.athing"):
        sibling = row.find_next_sibling("tr")
        text = row.get_text(" ", strip=True)
        if sibling:
            text += " " + sibling.get_text(" ", strip=True)
        link_tag = row.select_one(".titleline > a")
        url = link_tag.get("href") if link_tag else None
        blocks.append(f"URL={url}\n{text}")
    return blocksBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.

client = AsyncAnthropic()

async def parse_story(block: str) -> Story | None:
    prompt = f"""Extract fields from this Hacker News row as JSON.
Fields: rank (int), title (str), url (str or null), points (int), comments (int).
Return ONLY valid JSON, no preface.

<row>
{block}
</row>"""
    try:
        response = await client.messages.create(
            model="claude-opus-4-7",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}],
        )
        raw = response.content[0].text.strip()
        # strip potential code-fence
        if raw.startswith("```"):
            raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0]
        return Story.model_validate_json(raw)
    except Exception as e:
        print(f"parse failed: {e}")
        return None

async def main():
    html = await fetch_html("https://news.ycombinator.com/")
    blocks = extract_text_blocks(html)[:10]
    stories = await asyncio.gather(*(parse_story(b) for b in blocks))
    for s in filter(None, stories):
        print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)")

asyncio.run(main())Pydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.

Advanced: caching to avoid paying twice

import hashlib
import json
from pathlib import Path

CACHE = Path(".cache")
CACHE.mkdir(exist_ok=True)

async def parse_story_cached(block: str) -> Story | None:
    key = hashlib.sha256(block.encode()).hexdigest()[:16]
    cache_file = CACHE / f"{key}.json"
    if cache_file.exists():
        return Story.model_validate_json(cache_file.read_text())
    result = await parse_story(block)
    if result:
        cache_file.write_text(result.model_dump_json())
    return resultHash the input, check disk first. Cuts your LLM bill in half on reruns.

Mini-exercise

Run the scraper on a different site's RSS feed
Add a 'tags' field (list[str]) to Story and update the prompt
Save results to stories.jsonl (one JSON per line)
Measure: how much did caching save on the second run?

Pure CSS selectors	CSS + LLM extraction
Breaks when layout changes	Survives layout changes
Fast + free	Slower + costs tokens
Great for stable APIs	Great for messy HTML
Example: a JSON endpoint	Example: scraping blog posts

Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-scraper-creators

What is the core idea behind "Build It: Python Web Scraper With AI-Parsed Output"?
1. Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which term best describes a foundational idea in "Build It: Python Web Scraper With AI-Parsed Output"?
1. BeautifulSoup
2. scraping
3. pydantic
4. LLM extraction
A learner studying Build It: Python Web Scraper With AI-Parsed Output would need to understand which concept?
1. scraping
2. pydantic
3. BeautifulSoup
4. LLM extraction
Which of these is directly relevant to Build It: Python Web Scraper With AI-Parsed Output?
1. scraping
2. BeautifulSoup
3. LLM extraction
4. pydantic
Which of the following is a key point about Build It: Python Web Scraper With AI-Parsed Output?
1. Run the scraper on a different site's RSS feed
2. Add a 'tags' field (list[str]) to Story and update the prompt
3. Save results to stories.jsonl (one JSON per line)
4. Measure: how much did caching save on the second run?
Which of these does NOT belong in a discussion of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Save results to stories.jsonl (one JSON per line)
3. Add a 'tags' field (list[str]) to Story and update the prompt
4. Run the scraper on a different site's RSS feed
What is the key insight about "Respect the robots" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Always check a site's robots.txt and Terms. Hacker News is friendly. Many sites are not.
4. upsert
What is the key insight about "Cost check" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. upsert
4. Opus 4.7 is ~$15/M input tokens. Each story's prompt is ~300 tokens. 10 stories ≈ 3k tokens ≈ $0.045.
What is the recommended tip about "Always review AI output" in the context of Build It: Python Web Scraper With AI-Parsed Output?
1. AI-generated code can hallucinate APIs, miss edge cases, or introduce subtle bugs.
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which statement accurately describes an aspect of Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object.
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
What does working with Build It: Python Web Scraper With AI-Parsed Output typically involve?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
4. upsert
Which best describes the scope of "Build It: Python Web Scraper With AI-Parsed Output"?
1. It is unrelated to ai-coding workflows
2. It applies only to the opposite beginner tier
3. It was deprecated in 2024 and no longer relevant
4. It focuses on Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
1. Advanced: caching to avoid paying twice
2. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
1. Write another that returns a dict mapping name -> grade letter (A/B/C/F)
2. Mini-exercise
3. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
4. upsert
Which of the following is a concept covered in Build It: Python Web Scraper With AI-Parsed Output?
1. BeautifulSoup
2. pydantic
3. scraping
4. LLM extraction

← Back to interactive lesson