Lesson 163 of 1596
Build It: Python Web Scraper With AI-Parsed Output
Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
Creators · AI-Assisted Coding · ~36 min read
The project
We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object. This 'scrape + LLM-extract' pattern beats brittle CSS selectors whenever the page layout could change.
Setup: typed model for a story, async HTML fetcher.
# pyproject.toml dependencies # httpx, beautifulsoup4, anthropic, pydantic import asyncio import httpx from bs4 import BeautifulSoup from pydantic import BaseModel from anthropic import AsyncAnthropic class Story(BaseModel): rank: int title: str url: str | None points: int comments: int async def fetch_html(url: str) -> str: headers = {"User-Agent": "tendril-scraper/1.0"} async with httpx.AsyncClient(headers=headers, timeout=10) as client: r = await client.get(url) r.raise_for_status() return r.textBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.
def extract_text_blocks(html: str) -> list[str]: soup = BeautifulSoup(html, "html.parser") # HN wraps stories in <tr class="athing">; we'll just get the visible text of each row blocks = [] for row in soup.select("tr.athing"): sibling = row.find_next_sibling("tr") text = row.get_text(" ", strip=True) if sibling: text += " " + sibling.get_text(" ", strip=True) link_tag = row.select_one(".titleline > a") url = link_tag.get("href") if link_tag else None blocks.append(f"URL={url}\n{text}") return blocksPydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.
client = AsyncAnthropic() async def parse_story(block: str) -> Story | None: prompt = f"""Extract fields from this Hacker News row as JSON. Fields: rank (int), title (str), url (str or null), points (int), comments (int). Return ONLY valid JSON, no preface. <row> {block} </row>""" try: response = await client.messages.create( model="claude-opus-4-7", max_tokens=300, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text.strip() # strip potential code-fence if raw.startswith("```"): raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0] return Story.model_validate_json(raw) except Exception as e: print(f"parse failed: {e}") return None async def main(): html = await fetch_html("https://news.ycombinator.com/") blocks = extract_text_blocks(html)[:10] stories = await asyncio.gather(*(parse_story(b) for b in blocks)) for s in filter(None, stories): print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)") asyncio.run(main())Advanced: caching to avoid paying twice
Hash the input, check disk first. Cuts your LLM bill in half on reruns.
import hashlib import json from pathlib import Path CACHE = Path(".cache") CACHE.mkdir(exist_ok=True) async def parse_story_cached(block: str) -> Story | None: key = hashlib.sha256(block.encode()).hexdigest()[:16] cache_file = CACHE / f"{key}.json" if cache_file.exists(): return Story.model_validate_json(cache_file.read_text()) result = await parse_story(block) if result: cache_file.write_text(result.model_dump_json()) return resultMini-exercise
- 1Run the scraper on a different site's RSS feed
- 2Add a 'tags' field (list[str]) to Story and update the prompt
- 3Save results to stories.jsonl (one JSON per line)
- 4Measure: how much did caching save on the second run?
Compare the options
| Pure CSS selectors | CSS + LLM extraction |
|---|---|
| Breaks when layout changes | Survives layout changes |
| Fast + free | Slower + costs tokens |
| Great for stable APIs | Great for messy HTML |
| Example: a JSON endpoint | Example: scraping blog posts |
Key terms in this lesson
Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Build It: Python Web Scraper With AI-Parsed Output”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Python Async With AI
async/await lets one program wait on many things at once. Perfect for HTTP calls and LLM APIs. Let AI help you avoid the common traps.
Creators · 50 min
Installing and Using Claude Code CLI
Claude Code is Anthropic's terminal-native coding agent. Let's install it, wire it to a project, and use the features most engineers miss on day one.
Creators · 45 min
Installing and Using the OpenAI Codex CLI
Codex CLI is OpenAI's terminal coding agent. It runs locally, supports MCP, and ships a codex cloud mode for background tasks. Let's install it and compare it honestly to Claude Code.
