Tendril — AI Lessons for Real Life

Tendril

The premise

You cannot test a stochastic agent the same way you test a function — but you can replay a recording.

What AI does well here

Record real conversations and replay them in CI

Diff tool-call sequences for regressions

What AI cannot do

Catch novel failures the recording never saw

Substitute for live evals on real users

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-agent-deterministic-replay-test-creators

In deterministic replay testing, what is the primary purpose of recording a conversation?

To capture a specific execution trace that can be replayed in CI to verify consistent behavior
To demonstrate the AI's capabilities to stakeholders
To train the AI model on new examples
To enable tests to run without network connectivity

What is a fixture in the context of deterministic replay testing?

A type of assertion that compares AI outputs
A recorded LLM response stored and retrieved by message hash during test execution
A configuration file for CI pipelines
A testing framework for web applications

What three components make up a complete test case in deterministic replay testing?

User ID, session token, API key
Input transcript, expected tool sequence, expected final state
Model name, temperature setting, max tokens
Input prompt, expected token count, response time

Why are recordings alone insufficient as a complete testing strategy for AI agents?

Recordings cannot be version controlled
Recordings consume too much storage space
Recordings make tests run slower
Recordings cannot detect when the agent behaves correctly for novel inputs it has never seen

What happens to replay test recordings when the system prompt is modified?

They become more accurate because the AI has been improved
They are deleted to save storage space
They are automatically updated to match the new prompt
They become stale and may produce false negatives when the agent's actual behavior differs from the recording

What mechanism should be implemented to handle stale recordings after system prompt changes?

Ignoring prompt differences to maintain test stability
Automatic flagging of tests where the recorded prompt differs from current code
Manual review of all test recordings
Deletion of all recordings older than 30 days

What does diffing tool-call sequences help detect in replay testing?

Differences in response latency
Regressions in agent behavior that cause different tool execution order or selection
Memory leaks in the testing framework
Changes in the AI model's vocabulary

In replay testing, how are actual LLM calls replaced during test execution?

With random number generators
With hardcoded string literals
With mocked HTTP responses
With recorded responses keyed by message hash

What is the relationship between replay testing and continuous integration (CI)?

Replay tests cannot run in CI environments
Replay tests run in CI to catch behavior changes without being affected by model output jitter
Replay tests only run on developer workstations
Replay tests require manual execution in CI

A developer modifies the tool definitions available to an agent. What is the most appropriate next step for the replay test suite?

Review and likely re-record tests that depend on the modified tool definitions
Delete all existing recordings and start fresh
Increase the temperature setting to increase test coverage
No changes needed - replay tests will automatically adapt

What problem does deterministic replay testing specifically solve for AI agents?

It reduces the cost of API calls to zero
It eliminates the need for any test automation
It enables reliable, automated regression testing despite inherent randomness
It makes stochastic agents produce deterministic outputs

What is 'model jitter' in the context of AI agent testing?

Random variations in API response times
Memory leaks causing test instability
Inherent randomness in model outputs causing different results for identical inputs
Physical movement of server hardware

What type of failures can replay testing definitively catch?

Behavioral regressions in previously recorded scenarios
Failures caused by inputs the agent has never encountered before
Failures that occur only under specific network conditions
Failures related to UI rendering in edge browsers

Why is it important that fixtures are keyed by message hash rather than just stored sequentially?

Hash keys allow correct fixture retrieval regardless of test execution order or message position
Hash keys use less storage space
Hash keys are required by CI pipelines
Sequential keys are not supported by modern testing frameworks

A team wants to ensure their replay tests remain valid over time. What practice should they adopt?

Using the latest AI model available
Automatically flagging tests when recorded prompts differ from current code
Avoiding any changes to the agent after recording
Running tests only once per year

The premise

You cannot test a stochastic agent the same way you test a function — but you can replay a recording.

What AI does well here

Record real conversations and replay them in CI

Diff tool-call sequences for regressions

What AI cannot do

Catch novel failures the recording never saw

Substitute for live evals on real users

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-agent-deterministic-replay-test-creators

In deterministic replay testing, what is the primary purpose of recording a conversation?

To capture a specific execution trace that can be replayed in CI to verify consistent behavior
To demonstrate the AI's capabilities to stakeholders
To train the AI model on new examples
To enable tests to run without network connectivity

What is a fixture in the context of deterministic replay testing?

A type of assertion that compares AI outputs
A recorded LLM response stored and retrieved by message hash during test execution
A configuration file for CI pipelines
A testing framework for web applications

What three components make up a complete test case in deterministic replay testing?

User ID, session token, API key
Input transcript, expected tool sequence, expected final state
Model name, temperature setting, max tokens
Input prompt, expected token count, response time

Why are recordings alone insufficient as a complete testing strategy for AI agents?

Recordings cannot be version controlled
Recordings consume too much storage space
Recordings make tests run slower
Recordings cannot detect when the agent behaves correctly for novel inputs it has never seen

What happens to replay test recordings when the system prompt is modified?

They become more accurate because the AI has been improved
They are deleted to save storage space
They are automatically updated to match the new prompt
They become stale and may produce false negatives when the agent's actual behavior differs from the recording

What mechanism should be implemented to handle stale recordings after system prompt changes?

Ignoring prompt differences to maintain test stability
Automatic flagging of tests where the recorded prompt differs from current code
Manual review of all test recordings
Deletion of all recordings older than 30 days

What does diffing tool-call sequences help detect in replay testing?

Differences in response latency
Regressions in agent behavior that cause different tool execution order or selection
Memory leaks in the testing framework
Changes in the AI model's vocabulary

In replay testing, how are actual LLM calls replaced during test execution?

With random number generators
With hardcoded string literals
With mocked HTTP responses
With recorded responses keyed by message hash

What is the relationship between replay testing and continuous integration (CI)?

Replay tests cannot run in CI environments
Replay tests run in CI to catch behavior changes without being affected by model output jitter
Replay tests only run on developer workstations
Replay tests require manual execution in CI

A developer modifies the tool definitions available to an agent. What is the most appropriate next step for the replay test suite?

Review and likely re-record tests that depend on the modified tool definitions
Delete all existing recordings and start fresh
Increase the temperature setting to increase test coverage
No changes needed - replay tests will automatically adapt

What problem does deterministic replay testing specifically solve for AI agents?

It reduces the cost of API calls to zero
It eliminates the need for any test automation
It enables reliable, automated regression testing despite inherent randomness
It makes stochastic agents produce deterministic outputs

What is 'model jitter' in the context of AI agent testing?

Random variations in API response times
Memory leaks causing test instability
Inherent randomness in model outputs causing different results for identical inputs
Physical movement of server hardware

What type of failures can replay testing definitively catch?

Behavioral regressions in previously recorded scenarios
Failures caused by inputs the agent has never encountered before
Failures that occur only under specific network conditions
Failures related to UI rendering in edge browsers

Why is it important that fixtures are keyed by message hash rather than just stored sequentially?

Hash keys allow correct fixture retrieval regardless of test execution order or message position
Hash keys use less storage space
Hash keys are required by CI pipelines
Sequential keys are not supported by modern testing frameworks

A team wants to ensure their replay tests remain valid over time. What practice should they adopt?

Using the latest AI model available
Automatically flagging tests when recorded prompts differ from current code
Avoiding any changes to the agent after recording
Running tests only once per year

Deterministic replay tests for non-deterministic AI agents

The premise

What AI does well here

What AI cannot do

End-of-lesson check

Deterministic replay tests for non-deterministic AI agents

The premise

What AI does well here

What AI cannot do

End-of-lesson check