robots.txt and ai.txt: The Web's Consent Signals

A 30-year-old simple text file, robots.txt, is how the web has tried to regulate crawlers. The new ai.txt proposal aims to refine this for the AI era.

25 min · Reviewed 2026

A File From 1994

Martijn Koster created the Robots Exclusion Protocol in 1994. The idea is elegantly simple: every website hosts a plaintext file called robots.txt at its root. Well-behaved crawlers read it and follow its rules. It has no teeth, but it has held the web together for three decades.

The basic syntax

User-agent: *
Allow: /
Disallow: /private
Disallow: /admin

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xmlA minimal robots.txt

How AI crawlers handle it

GPTBot (OpenAI, launched August 2023) — respects robots.txt
ClaudeBot (Anthropic) — respects robots.txt
Google-Extended (Google's Gemini training) — respects robots.txt separately from search
CCBot (Common Crawl) — respects robots.txt
PerplexityBot — respects robots.txt, but some reports suggest inconsistency

Enter ai.txt

Because robots.txt conflates search indexing with AI training, several proposals have emerged for a separate ai.txt file. Spawning.ai published one standard; others are under discussion at the IETF. The core idea is that publishers should be able to say yes to search and no to AI training without blocking all crawlers.

# Example ai.txt
User-Agent: *
Disallow: /

# Allow specific AI uses
Allow-AI: research-noncommercial
Allow-AI: translation
Disallow-AI: training
Disallow-AI: generationA proposed ai.txt format

Alternative signals

HTML meta tags: <meta name="robots" content="noai">
HTTP headers: X-Robots-Tag: noai, noimageai
IPTC photo metadata: DataMining: prohibited
C2PA content credentials: cryptographically signed provenance

The big idea: the web's consent infrastructure was built for a different era. Updating it for AI is an open project, and every site maintainer is a small participant in the standard we end up with.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-robots-txt-ai-txt

What year was the Robots Exclusion Protocol originally created?
1. 2010
2. 2001
3. 1994
4. 1999
A website operator wants to allow search engine indexing but prevent AI companies from training on their content. Which file combination would achieve this?
1. Only ai.txt since robots.txt is obsolete
2. robots.txt configured for search + ai.txt configured to block AI training
3. Only robots.txt with a special AI flag
4. Both files set to block all crawlers
Which AI crawler has faced public reports of inconsistent adherence to robots.txt rules?
1. Google-Extended
2. PerplexityBot
3. ClaudeBot
4. GPTBot
A developer argues that robots.txt should legally force AI companies to respect its rules. What limitation exists in the current system?
1. Robots.txt only works for European websites
2. Neither robots.txt nor ai.txt has legal force on its own—they are signals only
3. AI companies are legally required to ignore robots.txt
4. Robots.txt has expired copyright protection
What technical mechanism would a photographer use to embed a machine-readable instruction that prevents AI training on their images?
1. Adding a watermark to images
2. Encrypting the image files
3. IPTC metadata tag DataMining: prohibited
4. Registering with a copyright office
Which organization published one proposed standard for ai.txt?
1. The U.S. government
2. Wikipedia
3. Spawning.ai
4. W3C
What does the HTTP header X-Robots-Tag allow website operators to do?
1. Block all users from certain countries
2. Automatically translate page content
3. Encrypt all website traffic
4. Specify crawler rules at the page or resource level rather than site-wide
A content creator wants to ensure their written articles cannot be used for AI training while still appearing in Google search results. What should they do?
1. Add a password to their website
2. Configure robots.txt to allow search crawlers + ai.txt to block AI training crawlers
3. Remove all images from their articles
4. Register as a nonprofit organization
Which company created the crawler named GPTBot?
1. Google
2. OpenAI
3. Meta
4. Microsoft
What is C2PA primarily designed to provide for digital content?
1. Better search engine rankings
2. Video compression
3. Faster page loading times
4. Cryptographically signed provenance and authenticity
Google-Extended behaves differently from other Google crawlers in what way?
1. It only indexes video content
2. It cannot be blocked
3. It requires payment to access
4. It respects robots.txt separately from Google's search crawler
What is the fundamental limitation that led to proposals like ai.txt?
1. Robots.txt cannot distinguish between search indexing and AI model training
2. Robots.txt requires payment to use
3. Robots.txt is encrypted
4. Robots.txt only works on English websites
The EU AI Act (2024) takes a step toward enforcement by requiring what?
1. All websites to block AI access
2. Crawlers to respect rights-reservations
3. AI models to disclose training sources
4. AI companies to pay royalties to all websites
What happens when a well-behaved crawler encounters a website's robots.txt file?
1. It reports the website to authorities
2. It reads and follows the rules specified within
3. It indexes the site faster
4. It deletes the file automatically
Which crawler is associated with Anthropic?
1. CCBot
2. Google-Extended
3. ClaudeBot
4. GPTBot

← Back to interactive lesson

Tendril · Creators · AI Foundations