Lesson 187 of 2116
Build It: Python Web Scraper With AI-Parsed Output
Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The project
- 2scraping
- 3BeautifulSoup
- 4httpx
Concept cluster
Terms to connect while reading
Section 1
The project
We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object. This 'scrape + LLM-extract' pattern beats brittle CSS selectors whenever the page layout could change.
Setup: typed model for a story, async HTML fetcher.
# pyproject.toml dependencies
# httpx, beautifulsoup4, anthropic, pydantic
import asyncio
import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel
from anthropic import AsyncAnthropic
class Story(BaseModel):
rank: int
title: str
url: str | None
points: int
comments: int
async def fetch_html(url: str) -> str:
headers = {"User-Agent": "tendril-scraper/1.0"}
async with httpx.AsyncClient(headers=headers, timeout=10) as client:
r = await client.get(url)
r.raise_for_status()
return r.textBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.
def extract_text_blocks(html: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
# HN wraps stories in <tr class="athing">; we'll just get the visible text of each row
blocks = []
for row in soup.select("tr.athing"):
sibling = row.find_next_sibling("tr")
text = row.get_text(" ", strip=True)
if sibling:
text += " " + sibling.get_text(" ", strip=True)
link_tag = row.select_one(".titleline > a")
url = link_tag.get("href") if link_tag else None
blocks.append(f"URL={url}\n{text}")
return blocksPydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.
client = AsyncAnthropic()
async def parse_story(block: str) -> Story | None:
prompt = f"""Extract fields from this Hacker News row as JSON.
Fields: rank (int), title (str), url (str or null), points (int), comments (int).
Return ONLY valid JSON, no preface.
<row>
{block}
</row>"""
try:
response = await client.messages.create(
model="claude-opus-4-7",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
raw = response.content[0].text.strip()
# strip potential code-fence
if raw.startswith("```"):
raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0]
return Story.model_validate_json(raw)
except Exception as e:
print(f"parse failed: {e}")
return None
async def main():
html = await fetch_html("https://news.ycombinator.com/")
blocks = extract_text_blocks(html)[:10]
stories = await asyncio.gather(*(parse_story(b) for b in blocks))
for s in filter(None, stories):
print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)")
asyncio.run(main())Advanced: caching to avoid paying twice
Hash the input, check disk first. Cuts your LLM bill in half on reruns.
import hashlib
import json
from pathlib import Path
CACHE = Path(".cache")
CACHE.mkdir(exist_ok=True)
async def parse_story_cached(block: str) -> Story | None:
key = hashlib.sha256(block.encode()).hexdigest()[:16]
cache_file = CACHE / f"{key}.json"
if cache_file.exists():
return Story.model_validate_json(cache_file.read_text())
result = await parse_story(block)
if result:
cache_file.write_text(result.model_dump_json())
return resultMini-exercise
- 1Run the scraper on a different site's RSS feed
- 2Add a 'tags' field (list[str]) to Story and update the prompt
- 3Save results to stories.jsonl (one JSON per line)
- 4Measure: how much did caching save on the second run?
Compare the options
| Pure CSS selectors | CSS + LLM extraction |
|---|---|
| Breaks when layout changes | Survives layout changes |
| Fast + free | Slower + costs tokens |
| Great for stable APIs | Great for messy HTML |
| Example: a JSON endpoint | Example: scraping blog posts |
Key terms in this lesson
Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Build It: Python Web Scraper With AI-Parsed Output”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Python Async With AI
async/await lets one program wait on many things at once. Perfect for HTTP calls and LLM APIs. Let AI help you avoid the common traps.
Creators · 50 min
The Landscape: Copilot vs. Cursor vs. Windsurf vs. Claude Code
The AI coding tool market fragmented fast. Let's map the 2026 landscape honestly: who is for autocomplete, who is for agents, who wins on cost, and what the tradeoffs actually feel like.
Creators · 50 min
Installing and Using Claude Code CLI
Claude Code is Anthropic's terminal-native coding agent. Let's install it, wire it to a project, and use the features most engineers miss on day one.
