Loading lesson…
Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object. This 'scrape + LLM-extract' pattern beats brittle CSS selectors whenever the page layout could change.
# pyproject.toml dependencies # httpx, beautifulsoup4, anthropic, pydantic import asyncio import httpx from bs4 import BeautifulSoup from pydantic import BaseModel from anthropic import AsyncAnthropic class Story(BaseModel): rank: int title: str url: str | None points: int comments: int async def fetch_html(url: str) -> str: headers = {"User-Agent": "tendril-scraper/1.0"} async with httpx.AsyncClient(headers=headers, timeout=10) as client: r = await client.get(url) r.raise_for_status() return r.textSetup: typed model for a story, async HTML fetcher.def extract_text_blocks(html: str) -> list[str]: soup = BeautifulSoup(html, "html.parser") # HN wraps stories in <tr class="athing">; we'll just get the visible text of each row blocks = [] for row in soup.select("tr.athing"): sibling = row.find_next_sibling("tr") text = row.get_text(" ", strip=True) if sibling: text += " " + sibling.get_text(" ", strip=True) link_tag = row.select_one(".titleline > a") url = link_tag.get("href") if link_tag else None blocks.append(f"URL={url}\n{text}") return blocksBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.client = AsyncAnthropic() async def parse_story(block: str) -> Story | None: prompt = f"""Extract fields from this Hacker News row as JSON. Fields: rank (int), title (str), url (str or null), points (int), comments (int). Return ONLY valid JSON, no preface. <row> {block} </row>""" try: response = await client.messages.create( model="claude-opus-4-7", max_tokens=300, messages=[{"role": "user", "content": prompt}], ) raw = response.content[0].text.strip() # strip potential code-fence if raw.startswith("```"): raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0] return Story.model_validate_json(raw) except Exception as e: print(f"parse failed: {e}") return None async def main(): html = await fetch_html("https://news.ycombinator.com/") blocks = extract_text_blocks(html)[:10] stories = await asyncio.gather(*(parse_story(b) for b in blocks)) for s in filter(None, stories): print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)") asyncio.run(main())Pydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.import hashlib import json from pathlib import Path CACHE = Path(".cache") CACHE.mkdir(exist_ok=True) async def parse_story_cached(block: str) -> Story | None: key = hashlib.sha256(block.encode()).hexdigest()[:16] cache_file = CACHE / f"{key}.json" if cache_file.exists(): return Story.model_validate_json(cache_file.read_text()) result = await parse_story(block) if result: cache_file.write_text(result.model_dump_json()) return resultHash the input, check disk first. Cuts your LLM bill in half on reruns.| Pure CSS selectors | CSS + LLM extraction |
|---|---|
| Breaks when layout changes | Survives layout changes |
| Fast + free | Slower + costs tokens |
| Great for stable APIs | Great for messy HTML |
| Example: a JSON endpoint | Example: scraping blog posts |
Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-scraper-creators
What is the main idea of "Build It: Python Web Scraper With AI-Parsed Output"?
Which concept is most central to "Build It: Python Web Scraper With AI-Parsed Output"?
Which use of AI fits this topic best?
What should a careful learner remember about "Respect the robots"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about scraping be treated?
Name one way to verify an AI answer about scraping.
Which action would help you apply "Build It: Python Web Scraper With AI-Parsed Output" responsibly?