Build It: Python Web Scraper With AI-Parsed Output

Setup: typed model for a story, async HTML fetcher.

python

# pyproject.toml dependencies # httpx, beautifulsoup4, anthropic, pydantic import asyncio import httpx from bs4 import BeautifulSoup from pydantic import BaseModel from anthropic import AsyncAnthropic class Story(BaseModel): rank: int title: str url: str | None points: int comments: int async def fetch_html(url: str) -> str: headers = {"User-Agent": "tendril-scraper/1.0"} async with httpx.AsyncClient(headers=headers, timeout=10) as client: r = await client.get(url) r.raise_for_status() return r.text

BeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.

python

def extract_text_blocks(html: str) -> list[str]: soup = BeautifulSoup(html, "html.parser") # HN wraps stories in <tr class="athing">; we'll just get the visible text of each row blocks = [] for row in soup.select("tr.athing"): sibling = row.find_next_sibling("tr") text = row.get_text(" ", strip=True) if sibling: text += " " + sibling.get_text(" ", strip=True) link_tag = row.select_one(".titleline > a") url = link_tag.get("href") if link_tag else None blocks.append(f"URL={url}\n{text}") return blocks

Pydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.

python

client = AsyncAnthropic() async def parse_story(block: str) -> Story | None: prompt = f"""Extract fields from this Hacker News row as JSON. Fields: rank (int), title (str), url (str or null), points (int), comments (int). Return ONLY valid JSON, no preface. <row> {block} </row>""" try: response = await client.messages.create( model="claude-opus-4-7", max_tokens=300, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text.strip() # strip potential code-fence if raw.startswith("```"): raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0] return Story.model_validate_json(raw) except Exception as e: print(f"parse failed: {e}") return None async def main(): html = await fetch_html("https://news.ycombinator.com/") blocks = extract_text_blocks(html)[:10] stories = await asyncio.gather(*(parse_story(b) for b in blocks)) for s in filter(None, stories): print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)") asyncio.run(main())

Hash the input, check disk first. Cuts your LLM bill in half on reruns.

python

import hashlib import json from pathlib import Path CACHE = Path(".cache") CACHE.mkdir(exist_ok=True) async def parse_story_cached(block: str) -> Story | None: key = hashlib.sha256(block.encode()).hexdigest()[:16] cache_file = CACHE / f"{key}.json" if cache_file.exists(): return Story.model_validate_json(cache_file.read_text()) result = await parse_story(block) if result: cache_file.write_text(result.model_dump_json()) return result

Compare the options

Pure CSS selectors	CSS + LLM extraction
Breaks when layout changes	Survives layout changes
Fast + free	Slower + costs tokens
Great for stable APIs	Great for messy HTML
Example: a JSON endpoint	Example: scraping blog posts

Key terms in this lesson

Build It: Python Web Scraper With AI-Parsed Output

The project

Advanced: caching to avoid paying twice

Mini-exercise

Curious about “Build It: Python Web Scraper With AI-Parsed Output”?

Keep going

Build It: Python Web Scraper With AI-Parsed Output

The project

Advanced: caching to avoid paying twice

Mini-exercise

Curious about “Build It: Python Web Scraper With AI-Parsed Output”?

Keep going