Loading lesson…
Scrape a site with httpx and BeautifulSoup, then hand messy text to Claude for structured extraction. A full project in 60 minutes.
We'll scrape Hacker News front page, grab the HTML, and use Claude to extract each story into a clean typed object. This 'scrape + LLM-extract' pattern beats brittle CSS selectors whenever the page layout could change.
# pyproject.toml dependencies
# httpx, beautifulsoup4, anthropic, pydantic
import asyncio
import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel
from anthropic import AsyncAnthropic
class Story(BaseModel):
rank: int
title: str
url: str | None
points: int
comments: int
async def fetch_html(url: str) -> str:
headers = {"User-Agent": "tendril-scraper/1.0"}
async with httpx.AsyncClient(headers=headers, timeout=10) as client:
r = await client.get(url)
r.raise_for_status()
return r.textSetup: typed model for a story, async HTML fetcher.def extract_text_blocks(html: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
# HN wraps stories in <tr class="athing">; we'll just get the visible text of each row
blocks = []
for row in soup.select("tr.athing"):
sibling = row.find_next_sibling("tr")
text = row.get_text(" ", strip=True)
if sibling:
text += " " + sibling.get_text(" ", strip=True)
link_tag = row.select_one(".titleline > a")
url = link_tag.get("href") if link_tag else None
blocks.append(f"URL={url}\n{text}")
return blocksBeautifulSoup gets us close — but we hand the messy text to an LLM for final structuring.client = AsyncAnthropic()
async def parse_story(block: str) -> Story | None:
prompt = f"""Extract fields from this Hacker News row as JSON.
Fields: rank (int), title (str), url (str or null), points (int), comments (int).
Return ONLY valid JSON, no preface.
<row>
{block}
</row>"""
try:
response = await client.messages.create(
model="claude-opus-4-7",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
raw = response.content[0].text.strip()
# strip potential code-fence
if raw.startswith("```"):
raw = raw.strip("`").split("\n", 1)[1].rsplit("\n", 1)[0]
return Story.model_validate_json(raw)
except Exception as e:
print(f"parse failed: {e}")
return None
async def main():
html = await fetch_html("https://news.ycombinator.com/")
blocks = extract_text_blocks(html)[:10]
stories = await asyncio.gather(*(parse_story(b) for b in blocks))
for s in filter(None, stories):
print(f"{s.rank:2}. {s.title} ({s.points} pts, {s.comments} comments)")
asyncio.run(main())Pydantic validates the LLM's JSON output. Bad output -> ValidationError -> None.import hashlib
import json
from pathlib import Path
CACHE = Path(".cache")
CACHE.mkdir(exist_ok=True)
async def parse_story_cached(block: str) -> Story | None:
key = hashlib.sha256(block.encode()).hexdigest()[:16]
cache_file = CACHE / f"{key}.json"
if cache_file.exists():
return Story.model_validate_json(cache_file.read_text())
result = await parse_story(block)
if result:
cache_file.write_text(result.model_dump_json())
return resultHash the input, check disk first. Cuts your LLM bill in half on reruns.| Pure CSS selectors | CSS + LLM extraction |
|---|---|
| Breaks when layout changes | Survives layout changes |
| Fast + free | Slower + costs tokens |
| Great for stable APIs | Great for messy HTML |
| Example: a JSON endpoint | Example: scraping blog posts |
Big idea: don't fight HTML with more regex. Get it into plain text, then let an LLM with a typed schema do the hard part.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-scraper-creators
What is the core idea behind "Build It: Python Web Scraper With AI-Parsed Output"?
Which term best describes a foundational idea in "Build It: Python Web Scraper With AI-Parsed Output"?
A learner studying Build It: Python Web Scraper With AI-Parsed Output would need to understand which concept?
Which of these is directly relevant to Build It: Python Web Scraper With AI-Parsed Output?
Which of the following is a key point about Build It: Python Web Scraper With AI-Parsed Output?
Which of these does NOT belong in a discussion of Build It: Python Web Scraper With AI-Parsed Output?
What is the key insight about "Respect the robots" in the context of Build It: Python Web Scraper With AI-Parsed Output?
What is the key insight about "Cost check" in the context of Build It: Python Web Scraper With AI-Parsed Output?
What is the recommended tip about "Always review AI output" in the context of Build It: Python Web Scraper With AI-Parsed Output?
Which statement accurately describes an aspect of Build It: Python Web Scraper With AI-Parsed Output?
What does working with Build It: Python Web Scraper With AI-Parsed Output typically involve?
Which best describes the scope of "Build It: Python Web Scraper With AI-Parsed Output"?
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
Which section heading best belongs in a lesson about Build It: Python Web Scraper With AI-Parsed Output?
Which of the following is a concept covered in Build It: Python Web Scraper With AI-Parsed Output?