Loading lesson…
A 30-year-old simple text file, robots.txt, is how the web has tried to regulate crawlers. The new ai.txt proposal aims to refine this for the AI era.
Martijn Koster created the Robots Exclusion Protocol in 1994. The idea is elegantly simple: every website hosts a plaintext file called robots.txt at its root. Well-behaved crawlers read it and follow its rules. It has no teeth, but it has held the web together for three decades.
User-agent: *
Allow: /
Disallow: /private
Disallow: /admin
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xmlA minimal robots.txtBecause robots.txt conflates search indexing with AI training, several proposals have emerged for a separate ai.txt file. Spawning.ai published one standard; others are under discussion at the IETF. The core idea is that publishers should be able to say yes to search and no to AI training without blocking all crawlers.
# Example ai.txt
User-Agent: *
Disallow: /
# Allow specific AI uses
Allow-AI: research-noncommercial
Allow-AI: translation
Disallow-AI: training
Disallow-AI: generationA proposed ai.txt formatThe big idea: the web's consent infrastructure was built for a different era. Updating it for AI is an open project, and every site maintainer is a small participant in the standard we end up with.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-robots-txt-ai-txt
What year was the Robots Exclusion Protocol originally created?
A website operator wants to allow search engine indexing but prevent AI companies from training on their content. Which file combination would achieve this?
Which AI crawler has faced public reports of inconsistent adherence to robots.txt rules?
A developer argues that robots.txt should legally force AI companies to respect its rules. What limitation exists in the current system?
What technical mechanism would a photographer use to embed a machine-readable instruction that prevents AI training on their images?
Which organization published one proposed standard for ai.txt?
What does the HTTP header X-Robots-Tag allow website operators to do?
A content creator wants to ensure their written articles cannot be used for AI training while still appearing in Google search results. What should they do?
Which company created the crawler named GPTBot?
What is C2PA primarily designed to provide for digital content?
Google-Extended behaves differently from other Google crawlers in what way?
What is the fundamental limitation that led to proposals like ai.txt?
The EU AI Act (2024) takes a step toward enforcement by requiring what?
What happens when a well-behaved crawler encounters a website's robots.txt file?
Which crawler is associated with Anthropic?