neural-forge.io

Sign inStartStart learning

Tendril

AI Foundations0%

Lesson 305 of 2116

robots.txt and ai.txt: The Web's Consent Signals

A 30-year-old simple text file, robots.txt, is how the web has tried to regulate crawlers. The new ai.txt proposal aims to refine this for the AI era.

CreatorsAI Foundations~15 min readAdvancedResearcherBI4 · Natural InteractionBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

25 min16 blocks3 concepts

Learning path

The main moves in order

1A File From 1994
2robots.txt
3ai.txt
4crawler consent

Concept cluster

Terms to connect while reading

robots.txtai.txtcrawler consent

Read3

Sections5

Lists2

Notes3

Code2

Terms1

Section 1

A File From 1994

Martijn Koster created the Robots Exclusion Protocol in 1994. The idea is elegantly simple: every website hosts a plaintext file called robots.txt at its root. Well-behaved crawlers read it and follow its rules. It has no teeth, but it has held the web together for three decades.

The basic syntax

A minimal robots.txt

text

User-agent: *
Allow: /
Disallow: /private
Disallow: /admin

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

How AI crawlers handle it

GPTBot (OpenAI, launched August 2023) — respects robots.txt
ClaudeBot (Anthropic) — respects robots.txt
Google-Extended (Google's Gemini training) — respects robots.txt separately from search
CCBot (Common Crawl) — respects robots.txt
PerplexityBot — respects robots.txt, but some reports suggest inconsistency

Check-in 1. Got it so far?

Enter ai.txt

Because robots.txt conflates search indexing with AI training, several proposals have emerged for a separate ai.txt file. Spawning.ai published one standard; others are under discussion at the IETF. The core idea is that publishers should be able to say yes to search and no to AI training without blocking all crawlers.

A proposed ai.txt format

text

# Example ai.txt
User-Agent: *
Disallow: /

# Allow specific AI uses
Allow-AI: research-noncommercial
Allow-AI: translation
Disallow-AI: training
Disallow-AI: generation

Check-in 2. Got it so far?

Alternative signals

HTML meta tags: <meta name="robots" content="noai">
HTTP headers: X-Robots-Tag: noai, noimageai
IPTC photo metadata: DataMining: prohibited
C2PA content credentials: cryptographically signed provenance

Check-in 3. Got it so far?

Key terms in this lesson

The big idea: the web's consent infrastructure was built for a different era. Updating it for AI is an open project, and every site maintainer is a small participant in the standard we end up with.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “robots.txt and ai.txt: The Web's Consent Signals”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going