Test-Driven Prompting — Failing Tests Are the Best Spec

Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Learn the discipline that makes AI code reliably correct because correctness is now executable.

12 min · Reviewed 2026

The Spec That Runs

A test is the most precise specification possible: "given this input, the function returns this output." There is no room for interpretation. AI loves rooms for interpretation. Take them away and the model gets dramatically more reliable.

The TDP loop with AI

1. RED:    You write a failing test (or several).
2. PROMPT: Paste the test. Ask the AI to make it pass without breaking others.
3. AGENT:  Edits source, runs the test suite, iterates until green.
4. GREEN:  All tests pass. You read the diff.
5. REFACTOR: "Improve readability without changing behavior. Tests must stay green."
6. COMMIT: Test + implementation as one atomic change.The tests are the spec, the agent is the implementer, you are the architect.

Why this defeats hallucination

Hallucinated APIs fail to import — the test never even runs
Off-by-one bugs fail edge-case tests immediately
Wrong type returns fail assertion checks
The agent has a hard signal (test status), not a soft one (your opinion)
You can't ship code that fails tests, so wrong code can't escape

A real example in TypeScript

// Step 1 — write the failing tests yourself
import { describe, it, expect } from "vitest";
import { parseDuration } from "./parse-duration";

describe("parseDuration", () => {
  it("parses simple seconds", () => {
    expect(parseDuration("30s")).toBe(30);
  });
  it("parses minutes", () => {
    expect(parseDuration("5m")).toBe(300);
  });
  it("parses combined", () => {
    expect(parseDuration("1h30m")).toBe(5400);
  });
  it("throws on bad input", () => {
    expect(() => parseDuration("banana")).toThrow();
  });
  it("returns 0 for empty string", () => {
    expect(parseDuration("")).toBe(0);
  });
});Five tests, five sentences of spec — including the edge cases AI usually skips.

# Step 2 — the prompt

"Make all tests in parse-duration.test.ts pass without breaking
  any other test in the suite. Run `pnpm test` after each edit.
  Show me the final implementation only."

# Claude Code, Cursor Agent, and Codex CLI will all do this end-to-end
# without further intervention.The agent now has executable success criteria. Iteration converges fast.

The five edge cases to always include

Empty input (string "", array [], object {})
Null and undefined
The boundary value (zero, max int, very long string)
Unicode (emojis, RTL text, combining characters)
Concurrent calls if the function is async — does it race?

The TDP anti-pattern: AI writes tests that pass

# DANGEROUS PROMPT:
"Write the function and the tests for it."

# What you'll get: tests written to match the implementation,
# not the spec. Tests pass. Code may still be wrong.
# The tests are tautologies.

# SAFE PROMPT:
"Here are the tests. Here is the function signature.
  Implement the function so all tests pass. Do not add or modify
  any test."If the AI controls both sides of the equation, both sides will agree even when wrong.

Pair TDP with mutation testing for confidence

Mutation testing tools (Stryker for JS, mutmut for Python) randomly mutate your source and check if any test fails. If your tests don't catch the mutation, your tests have holes — exactly the holes AI loves to hide bugs in. Run mutation tests once after a TDP session and you'll see how good your spec really is.

The test is the only prompt the machine cannot misinterpret.
— A staff engineer who learned the hard way

The big idea: AI is unreliable in proportion to how much it can interpret your intent. Tests collapse interpretation to zero. Write the contract first, lock the tests, and let the agent grind to green. The result is the most reliable AI-generated code you will ever ship.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-test-driven-prompting-creators

What is the core idea behind "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Learn the discipline that makes AI code reliably correct because correctness is now executable.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which term best describes a foundational idea in "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. executable spec
2. test-driven prompting
3. mutation testing
4. red-green-refactor
A learner studying Test-Driven Prompting — Failing Tests Are the Best Spec would need to understand which concept?
1. test-driven prompting
2. mutation testing
3. executable spec
4. red-green-refactor
Which of these is directly relevant to Test-Driven Prompting — Failing Tests Are the Best Spec?
1. test-driven prompting
2. executable spec
3. red-green-refactor
4. mutation testing
Which of the following is a key point about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Hallucinated APIs fail to import — the test never even runs
2. Off-by-one bugs fail edge-case tests immediately
3. Wrong type returns fail assertion checks
4. The agent has a hard signal (test status), not a soft one (your opinion)
Which of these does NOT belong in a discussion of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Off-by-one bugs fail edge-case tests immediately
2. Hallucinated APIs fail to import — the test never even runs
3. Models freeze at their training cutoff. The libraries you use have not.
4. Wrong type returns fail assertion checks
Which statement is accurate regarding Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Null and undefined
2. The boundary value (zero, max int, very long string)
3. Empty input (string "", array [], object {})
4. Unicode (emojis, RTL text, combining characters)
Which of these does NOT belong in a discussion of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. Null and undefined
3. The boundary value (zero, max int, very long string)
4. Empty input (string "", array [], object {})
What is the key insight about "Let the AI write tests too — but you write the FIRST one" in the context of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. The first test sets the contract. Once that exists, ask the AI to add edge cases.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
What is the key insight about "Lock the tests" in the context of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. In serious projects, mark test files as off-limits via Cursor rules or Claude Code permissions.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which statement accurately describes an aspect of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. A test is the most precise specification possible: "given this input, the function returns this output." There is no room for interpretation.
4. step gate
What does working with Test-Driven Prompting — Failing Tests Are the Best Spec typically involve?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. step gate
4. Mutation testing tools (Stryker for JS, mutmut for Python) randomly mutate your source and check if any test fails.
Which of the following is true about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. The big idea: AI is unreliable in proportion to how much it can interpret your intent. Tests collapse interpretation to zero.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which best describes the scope of "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. It is unrelated to ai-coding workflows
2. It focuses on Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Lea
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. The TDP loop with AI
4. step gate

← Back to interactive lesson

Tendril · Creators · AI-Assisted Coding

Test-Driven Prompting — Failing Tests Are the Best Spec

Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Learn the discipline that makes AI code reliably correct because correctness is now executable.

12 min · Reviewed 2026

The Spec That Runs

The TDP loop with AI

1. RED:    You write a failing test (or several).
2. PROMPT: Paste the test. Ask the AI to make it pass without breaking others.
3. AGENT:  Edits source, runs the test suite, iterates until green.
4. GREEN:  All tests pass. You read the diff.
5. REFACTOR: "Improve readability without changing behavior. Tests must stay green."
6. COMMIT: Test + implementation as one atomic change.The tests are the spec, the agent is the implementer, you are the architect.

Why this defeats hallucination

Hallucinated APIs fail to import — the test never even runs
Off-by-one bugs fail edge-case tests immediately
Wrong type returns fail assertion checks
The agent has a hard signal (test status), not a soft one (your opinion)
You can't ship code that fails tests, so wrong code can't escape

A real example in TypeScript

// Step 1 — write the failing tests yourself
import { describe, it, expect } from "vitest";
import { parseDuration } from "./parse-duration";

describe("parseDuration", () => {
  it("parses simple seconds", () => {
    expect(parseDuration("30s")).toBe(30);
  });
  it("parses minutes", () => {
    expect(parseDuration("5m")).toBe(300);
  });
  it("parses combined", () => {
    expect(parseDuration("1h30m")).toBe(5400);
  });
  it("throws on bad input", () => {
    expect(() => parseDuration("banana")).toThrow();
  });
  it("returns 0 for empty string", () => {
    expect(parseDuration("")).toBe(0);
  });
});Five tests, five sentences of spec — including the edge cases AI usually skips.

# Step 2 — the prompt

"Make all tests in parse-duration.test.ts pass without breaking
  any other test in the suite. Run `pnpm test` after each edit.
  Show me the final implementation only."

# Claude Code, Cursor Agent, and Codex CLI will all do this end-to-end
# without further intervention.The agent now has executable success criteria. Iteration converges fast.

The five edge cases to always include

Empty input (string "", array [], object {})
Null and undefined
The boundary value (zero, max int, very long string)
Unicode (emojis, RTL text, combining characters)
Concurrent calls if the function is async — does it race?

The TDP anti-pattern: AI writes tests that pass

# DANGEROUS PROMPT:
"Write the function and the tests for it."

# What you'll get: tests written to match the implementation,
# not the spec. Tests pass. Code may still be wrong.
# The tests are tautologies.

# SAFE PROMPT:
"Here are the tests. Here is the function signature.
  Implement the function so all tests pass. Do not add or modify
  any test."If the AI controls both sides of the equation, both sides will agree even when wrong.

Pair TDP with mutation testing for confidence

The test is the only prompt the machine cannot misinterpret.
— A staff engineer who learned the hard way

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-coding-debug-test-driven-prompting-creators

What is the core idea behind "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Learn the discipline that makes AI code reliably correct because correctness is now executable.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which term best describes a foundational idea in "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. executable spec
2. test-driven prompting
3. mutation testing
4. red-green-refactor
A learner studying Test-Driven Prompting — Failing Tests Are the Best Spec would need to understand which concept?
1. test-driven prompting
2. mutation testing
3. executable spec
4. red-green-refactor
Which of these is directly relevant to Test-Driven Prompting — Failing Tests Are the Best Spec?
1. test-driven prompting
2. executable spec
3. red-green-refactor
4. mutation testing
Which of the following is a key point about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Hallucinated APIs fail to import — the test never even runs
2. Off-by-one bugs fail edge-case tests immediately
3. Wrong type returns fail assertion checks
4. The agent has a hard signal (test status), not a soft one (your opinion)
Which of these does NOT belong in a discussion of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Off-by-one bugs fail edge-case tests immediately
2. Hallucinated APIs fail to import — the test never even runs
3. Models freeze at their training cutoff. The libraries you use have not.
4. Wrong type returns fail assertion checks
Which statement is accurate regarding Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Null and undefined
2. The boundary value (zero, max int, very long string)
3. Empty input (string "", array [], object {})
4. Unicode (emojis, RTL text, combining characters)
Which of these does NOT belong in a discussion of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. Null and undefined
3. The boundary value (zero, max int, very long string)
4. Empty input (string "", array [], object {})
What is the key insight about "Let the AI write tests too — but you write the FIRST one" in the context of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. The first test sets the contract. Once that exists, ask the AI to add edge cases.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
What is the key insight about "Lock the tests" in the context of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. In serious projects, mark test files as off-limits via Cursor rules or Claude Code permissions.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which statement accurately describes an aspect of Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. A test is the most precise specification possible: "given this input, the function returns this output." There is no room for interpretation.
4. step gate
What does working with Test-Driven Prompting — Failing Tests Are the Best Spec typically involve?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. step gate
4. Mutation testing tools (Stryker for JS, mutmut for Python) randomly mutate your source and check if any test fails.
Which of the following is true about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. The big idea: AI is unreliable in proportion to how much it can interpret your intent. Tests collapse interpretation to zero.
2. Models freeze at their training cutoff. The libraries you use have not.
3. It contradicts a constraint you set at the start of the session
4. step gate
Which best describes the scope of "Test-Driven Prompting — Failing Tests Are the Best Spec"?
1. It is unrelated to ai-coding workflows
2. It focuses on Test-driven development meets AI: paste a failing test, ask the agent to make it green, iterate. Lea
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Test-Driven Prompting — Failing Tests Are the Best Spec?
1. Models freeze at their training cutoff. The libraries you use have not.
2. It contradicts a constraint you set at the start of the session
3. The TDP loop with AI
4. step gate

← Back to interactive lesson