Build It: A Daily Data Pipeline With LLM Enrichment

Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The pattern powers most data-engineering AI work.

70 min · Reviewed 2026

What we're building

A script that: fetches yesterday's top stories from an API, loads them into a DataFrame, asks Claude to classify each as positive/negative/neutral, and writes results to a SQLite database. Idempotent — rerun safely.

# pyproject.toml: httpx, pandas, anthropic, sqlalchemy
import asyncio
import httpx
import pandas as pd
from sqlalchemy import create_engine, text
from anthropic import AsyncAnthropic

DB_URL = "sqlite:///pipeline.db"
engine = create_engine(DB_URL)

def init_db():
    with engine.begin() as conn:
        conn.execute(text("""
            CREATE TABLE IF NOT EXISTS stories (
                id INTEGER PRIMARY KEY,
                title TEXT NOT NULL,
                url TEXT,
                fetched_at TEXT NOT NULL,
                sentiment TEXT,
                UNIQUE(id)
            )
        """))Setup: a simple SQLite table. The UNIQUE(id) is what makes reruns safe.

async def fetch_stories() -> pd.DataFrame:
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    async with httpx.AsyncClient(timeout=10) as client:
        ids = (await client.get(url)).json()[:20]
        async def get_one(sid):
            r = await client.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json")
            return r.json()
        raw = await asyncio.gather(*(get_one(i) for i in ids))
    df = pd.DataFrame(raw)[["id", "title", "url"]].dropna(subset=["title"])
    df["fetched_at"] = pd.Timestamp.utcnow().isoformat()
    return dfConcurrent fetch of 20 stories into a DataFrame. pandas handles the shape.

client = AsyncAnthropic()
sem = asyncio.Semaphore(5)

async def classify(title: str) -> str:
    async with sem:
        try:
            r = await client.messages.create(
                model="claude-haiku-4-5",   # cheaper for bulk classification
                max_tokens=10,
                messages=[{
                    "role": "user",
                    "content": f"Classify sentiment of this headline as exactly one word: positive, negative, or neutral. No other text.\n\nHeadline: {title}"
                }],
            )
            word = r.content[0].text.strip().lower()
            return word if word in {"positive", "negative", "neutral"} else "neutral"
        except Exception as e:
            print(f"classify failed: {e}")
            return "unknown"

async def enrich(df: pd.DataFrame) -> pd.DataFrame:
    sentiments = await asyncio.gather(*(classify(t) for t in df["title"]))
    df = df.copy()
    df["sentiment"] = sentiments
    return dfHaiku is ~10x cheaper than Opus — perfect for bulk labeling. Semaphore caps concurrency.

def upsert(df: pd.DataFrame) -> int:
    sql = text("""
        INSERT INTO stories (id, title, url, fetched_at, sentiment)
        VALUES (:id, :title, :url, :fetched_at, :sentiment)
        ON CONFLICT(id) DO UPDATE SET
            sentiment = excluded.sentiment,
            fetched_at = excluded.fetched_at
    """)
    rows = df.to_dict(orient="records")
    with engine.begin() as conn:
        conn.execute(sql, rows)
    return len(rows)

async def main():
    init_db()
    df = await fetch_stories()
    df = await enrich(df)
    n = upsert(df)
    print(f"Inserted/updated {n} rows.")
    # Read back top negatives
    with engine.connect() as conn:
        neg = pd.read_sql("SELECT title FROM stories WHERE sentiment = 'negative' LIMIT 5", conn)
    print("\nRecent negative headlines:")
    print(neg.to_string(index=False))

asyncio.run(main())Upsert with ON CONFLICT — the idempotency trick. You can rerun all day without creating duplicates.

Cost math

Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
Each classify call: ~60 input, ~5 output tokens
20 stories ≈ 1300 tokens ≈ $0.001
Running daily for a year ≈ $0.50. Free-tier budget.

Mini-exercise

Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
Add a scheduled run using GitHub Actions or cron
Export a weekly sentiment breakdown chart using pandas + matplotlib
Track cost: log tokens used per run into a costs table

Sync pandas loop	Async concurrent LLM calls
20 stories × 1s = 20s	~2s with concurrency 5
Simple code	Needs asyncio.gather
Good for: prototyping	Good for: anything bigger than 10 rows

Big idea: a data pipeline is fetch → transform → load. LLM enrichment slots in as another transform. Make each step idempotent and you can cron it forever.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-data-pipeline-creators

What is the core idea behind "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The pattern powers most data-engineering AI work.
2. Add a docstring
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which term best describes a foundational idea in "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. pandas
2. ETL
3. SQLite
4. upsert
A learner studying Build It: A Daily Data Pipeline With LLM Enrichment would need to understand which concept?
1. ETL
2. SQLite
3. pandas
4. upsert
Which of these is directly relevant to Build It: A Daily Data Pipeline With LLM Enrichment?
1. ETL
2. pandas
3. upsert
4. SQLite
Which of the following is a key point about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
2. Each classify call: ~60 input, ~5 output tokens
3. 20 stories ≈ 1300 tokens ≈ $0.001
4. Running daily for a year ≈ $0.50. Free-tier budget.
Which of these does NOT belong in a discussion of Build It: A Daily Data Pipeline With LLM Enrichment?
1. 20 stories ≈ 1300 tokens ≈ $0.001
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Add a docstring
4. Each classify call: ~60 input, ~5 output tokens
Which statement is accurate regarding Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a scheduled run using GitHub Actions or cron
2. Export a weekly sentiment breakdown chart using pandas + matplotlib
3. Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
4. Track cost: log tokens used per run into a costs table
Which of these does NOT belong in a discussion of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a scheduled run using GitHub Actions or cron
2. Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
3. Add a docstring
4. Export a weekly sentiment breakdown chart using pandas + matplotlib
What is the key insight about "Idempotency is the hallmark of good pipelines" in the context of Build It: A Daily Data Pipeline With LLM Enrichment?
1. A pipeline should produce the same result whether it runs once or ten times.
2. Add a docstring
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
What is the recommended tip about "Always review AI output" in the context of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. AI-generated code can hallucinate APIs, miss edge cases, or introduce subtle bugs.
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which statement accurately describes an aspect of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. A script that: fetches yesterday's top stories from an API, loads them into a DataFrame, asks Claude to classify each as positive/negative/n…
4. An argument is the value you pass when calling ("Maya", 5)
What does working with Build It: A Daily Data Pipeline With LLM Enrichment typically involve?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. An argument is the value you pass when calling ("Maya", 5)
4. Big idea: a data pipeline is fetch → transform → load. LLM enrichment slots in as another transform.
Which best describes the scope of "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. It focuses on Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The patt
2. It is unrelated to ai-coding workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Cost math
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which section heading best belongs in a lesson about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. Mini-exercise
4. An argument is the value you pass when calling ("Maya", 5)

← Back to interactive lesson

Tendril · Creators · AI-Assisted Coding

Build It: A Daily Data Pipeline With LLM Enrichment

Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The pattern powers most data-engineering AI work.

70 min · Reviewed 2026

What we're building

# pyproject.toml: httpx, pandas, anthropic, sqlalchemy
import asyncio
import httpx
import pandas as pd
from sqlalchemy import create_engine, text
from anthropic import AsyncAnthropic

DB_URL = "sqlite:///pipeline.db"
engine = create_engine(DB_URL)

def init_db():
    with engine.begin() as conn:
        conn.execute(text("""
            CREATE TABLE IF NOT EXISTS stories (
                id INTEGER PRIMARY KEY,
                title TEXT NOT NULL,
                url TEXT,
                fetched_at TEXT NOT NULL,
                sentiment TEXT,
                UNIQUE(id)
            )
        """))Setup: a simple SQLite table. The UNIQUE(id) is what makes reruns safe.

async def fetch_stories() -> pd.DataFrame:
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    async with httpx.AsyncClient(timeout=10) as client:
        ids = (await client.get(url)).json()[:20]
        async def get_one(sid):
            r = await client.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json")
            return r.json()
        raw = await asyncio.gather(*(get_one(i) for i in ids))
    df = pd.DataFrame(raw)[["id", "title", "url"]].dropna(subset=["title"])
    df["fetched_at"] = pd.Timestamp.utcnow().isoformat()
    return dfConcurrent fetch of 20 stories into a DataFrame. pandas handles the shape.

client = AsyncAnthropic()
sem = asyncio.Semaphore(5)

async def classify(title: str) -> str:
    async with sem:
        try:
            r = await client.messages.create(
                model="claude-haiku-4-5",   # cheaper for bulk classification
                max_tokens=10,
                messages=[{
                    "role": "user",
                    "content": f"Classify sentiment of this headline as exactly one word: positive, negative, or neutral. No other text.\n\nHeadline: {title}"
                }],
            )
            word = r.content[0].text.strip().lower()
            return word if word in {"positive", "negative", "neutral"} else "neutral"
        except Exception as e:
            print(f"classify failed: {e}")
            return "unknown"

async def enrich(df: pd.DataFrame) -> pd.DataFrame:
    sentiments = await asyncio.gather(*(classify(t) for t in df["title"]))
    df = df.copy()
    df["sentiment"] = sentiments
    return dfHaiku is ~10x cheaper than Opus — perfect for bulk labeling. Semaphore caps concurrency.

def upsert(df: pd.DataFrame) -> int:
    sql = text("""
        INSERT INTO stories (id, title, url, fetched_at, sentiment)
        VALUES (:id, :title, :url, :fetched_at, :sentiment)
        ON CONFLICT(id) DO UPDATE SET
            sentiment = excluded.sentiment,
            fetched_at = excluded.fetched_at
    """)
    rows = df.to_dict(orient="records")
    with engine.begin() as conn:
        conn.execute(sql, rows)
    return len(rows)

async def main():
    init_db()
    df = await fetch_stories()
    df = await enrich(df)
    n = upsert(df)
    print(f"Inserted/updated {n} rows.")
    # Read back top negatives
    with engine.connect() as conn:
        neg = pd.read_sql("SELECT title FROM stories WHERE sentiment = 'negative' LIMIT 5", conn)
    print("\nRecent negative headlines:")
    print(neg.to_string(index=False))

asyncio.run(main())Upsert with ON CONFLICT — the idempotency trick. You can rerun all day without creating duplicates.

Cost math

Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
Each classify call: ~60 input, ~5 output tokens
20 stories ≈ 1300 tokens ≈ $0.001
Running daily for a year ≈ $0.50. Free-tier budget.

Mini-exercise

Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
Add a scheduled run using GitHub Actions or cron
Export a weekly sentiment breakdown chart using pandas + matplotlib
Track cost: log tokens used per run into a costs table

Sync pandas loop	Async concurrent LLM calls
20 stories × 1s = 20s	~2s with concurrency 5
Simple code	Needs asyncio.gather
Good for: prototyping	Good for: anything bigger than 10 rows

Big idea: a data pipeline is fetch → transform → load. LLM enrichment slots in as another transform. Make each step idempotent and you can cron it forever.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-prog-python-data-pipeline-creators

What is the core idea behind "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The pattern powers most data-engineering AI work.
2. Add a docstring
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which term best describes a foundational idea in "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. pandas
2. ETL
3. SQLite
4. upsert
A learner studying Build It: A Daily Data Pipeline With LLM Enrichment would need to understand which concept?
1. ETL
2. SQLite
3. pandas
4. upsert
Which of these is directly relevant to Build It: A Daily Data Pipeline With LLM Enrichment?
1. ETL
2. pandas
3. upsert
4. SQLite
Which of the following is a key point about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
2. Each classify call: ~60 input, ~5 output tokens
3. 20 stories ≈ 1300 tokens ≈ $0.001
4. Running daily for a year ≈ $0.50. Free-tier budget.
Which of these does NOT belong in a discussion of Build It: A Daily Data Pipeline With LLM Enrichment?
1. 20 stories ≈ 1300 tokens ≈ $0.001
2. Haiku 4.5: $1/M input, $5/M output tokens (2026 pricing)
3. Add a docstring
4. Each classify call: ~60 input, ~5 output tokens
Which statement is accurate regarding Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a scheduled run using GitHub Actions or cron
2. Export a weekly sentiment breakdown chart using pandas + matplotlib
3. Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
4. Track cost: log tokens used per run into a costs table
Which of these does NOT belong in a discussion of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a scheduled run using GitHub Actions or cron
2. Add a 'topic' column — ask Claude to pick one of: tech, politics, science, other
3. Add a docstring
4. Export a weekly sentiment breakdown chart using pandas + matplotlib
What is the key insight about "Idempotency is the hallmark of good pipelines" in the context of Build It: A Daily Data Pipeline With LLM Enrichment?
1. A pipeline should produce the same result whether it runs once or ten times.
2. Add a docstring
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
What is the recommended tip about "Always review AI output" in the context of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. AI-generated code can hallucinate APIs, miss edge cases, or introduce subtle bugs.
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which statement accurately describes an aspect of Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. A script that: fetches yesterday's top stories from an API, loads them into a DataFrame, asks Claude to classify each as positive/negative/n…
4. An argument is the value you pass when calling ("Maya", 5)
What does working with Build It: A Daily Data Pipeline With LLM Enrichment typically involve?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. An argument is the value you pass when calling ("Maya", 5)
4. Big idea: a data pipeline is fetch → transform → load. LLM enrichment slots in as another transform.
Which best describes the scope of "Build It: A Daily Data Pipeline With LLM Enrichment"?
1. It focuses on Pull data from an API, clean it with pandas, ask Claude to enrich each row, save to SQLite. The patt
2. It is unrelated to ai-coding workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Cost math
3. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
4. An argument is the value you pass when calling ("Maya", 5)
Which section heading best belongs in a lesson about Build It: A Daily Data Pipeline With LLM Enrichment?
1. Add a docstring
2. Return the sum of items times (1 + tax_rate), rounded to 2 decimals
3. Mini-exercise
4. An argument is the value you pass when calling ("Maya", 5)

← Back to interactive lesson