Agent Safety: Sandboxes and Human-in-the-Loop

Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent safe: wall it off from things it shouldn't touch, and put a human in the decision path.

34 min · Reviewed 2026

Why safety is not optional

An agent with filesystem access can delete your thesis. An agent with email access can send 'you're fired' to your entire team. An agent with credit card access can order 400 rubber ducks. These are not hypothetical — every one has happened in production in the last 18 months. Safety isn't a feature, it's infrastructure.

Defense one: sandbox

A sandbox is a walled space where the agent can only see and touch specific things. Outside the wall, the agent has no power. Modern options in 2026:

Vercel Sandbox — Firecracker microVMs for running untrusted code.
Docker containers with bind-mounted folders only (classic and free).
Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
Browser-only agents (Browser Use, Operator) — can't touch your filesystem.
Virtual machines (VirtualBox, UTM, Parallels) — full isolation if you're paranoid.

{
  "agent": "cleanup-bot",
  "sandbox": {
    "filesystem": {
      "read": ["~/Downloads", "~/Desktop"],
      "write": ["~/Downloads/archive"],
      "deny": ["~/Documents", "~/Library", "~/.ssh"]
    },
    "network": "none",
    "shell": false
  }
}An example permission config. Deny by default. Allow exactly what's needed.

Defense two: human-in-the-loop

For any action that's destructive (delete, send, pay, push), require a human to approve before the agent proceeds. Yes, it slows things down. Yes, it's worth it. Every serious platform — Claude Code, Devin, OpenClaw Mission Control — has an approval gate feature.

AGENT: I want to run: rm -rf ~/old_project
       Reason: You asked me to clean up old projects.
       Impact: Deletes 1.2 GB, 847 files.
       
AWAITING APPROVAL (y/n/details):What an approval gate should look like: action, reason, impact, pause.

Tiered approval

Risk	Approval policy	Examples
Low (read-only)	Auto-approve.	List files, read a page, query an API.
Medium (reversible)	Batch-approve with review.	Create a branch, draft an email.
High (destructive)	Per-action human approval.	Delete files, send emails, make payments.
Critical (irreversible)	Two-person approval + logging.	Deploy prod, wire money, delete users.

Good agent design is mostly about building the safe envelope, not the clever prompt. Get the envelope right and you can run bold experiments. Get it wrong and one bad run can ruin a week.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-safety-sandbox-builders

What is the core idea behind "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent safe: wall it off from things it shouldn't touch, and put a human in the decision path.
2. Recover attribution for untagged historical calls.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which term best describes a foundational idea in "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. human-in-the-loop
2. sandbox
3. least privilege
4. approval gate
A learner studying Agent Safety: Sandboxes and Human-in-the-Loop would need to understand which concept?
1. sandbox
2. least privilege
3. human-in-the-loop
4. approval gate
Which of these is directly relevant to Agent Safety: Sandboxes and Human-in-the-Loop?
1. sandbox
2. human-in-the-loop
3. approval gate
4. least privilege
Which of the following is a key point about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Vercel Sandbox — Firecracker microVMs for running untrusted code.
2. Docker containers with bind-mounted folders only (classic and free).
3. Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
4. Browser-only agents (Browser Use, Operator) — can't touch your filesystem.
Which of these does NOT belong in a discussion of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Vercel Sandbox — Firecracker microVMs for running untrusted code.
2. Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
3. Recover attribution for untagged historical calls.
4. Docker containers with bind-mounted folders only (classic and free).
What is the key insight about "Assume the agent will mess up" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Not because AI is evil, but because AI is wrong some of the time.
4. Choosing tools that match keywords in user requests
What is the key insight about "The rule of three" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Choosing tools that match keywords in user requests
4. Three questions before giving an agent any permission: (1) What's the worst that could happen? (2) How would I notice? (…
What is the key insight about "Never give agents your master password" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Use API keys with scoped permissions. Use app-specific passwords. Revoke when a task finishes.
2. Recover attribution for untagged historical calls.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which statement accurately describes an aspect of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. An agent with filesystem access can delete your thesis. An agent with email access can send 'you're fired' to your entire team.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
What does working with Agent Safety: Sandboxes and Human-in-the-Loop typically involve?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. A sandbox is a walled space where the agent can only see and touch specific things. Outside the wall, the agent has no power.
4. Choosing tools that match keywords in user requests
Which of the following is true about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Choosing tools that match keywords in user requests
4. For any action that's destructive (delete, send, pay, push), require a human to approve before the agent proceeds. Yes, it slows things down.
Which best describes the scope of "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. It focuses on Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent s
2. It is unrelated to agentic workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Defense one: sandbox
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which section heading best belongs in a lesson about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Defense two: human-in-the-loop
4. Choosing tools that match keywords in user requests

← Back to interactive lesson

Tendril · Builders · Agentic AI

Agent Safety: Sandboxes and Human-in-the-Loop

Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent safe: wall it off from things it shouldn't touch, and put a human in the decision path.

34 min · Reviewed 2026

Why safety is not optional

Defense one: sandbox

A sandbox is a walled space where the agent can only see and touch specific things. Outside the wall, the agent has no power. Modern options in 2026:

Vercel Sandbox — Firecracker microVMs for running untrusted code.
Docker containers with bind-mounted folders only (classic and free).
Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
Browser-only agents (Browser Use, Operator) — can't touch your filesystem.
Virtual machines (VirtualBox, UTM, Parallels) — full isolation if you're paranoid.

{
  "agent": "cleanup-bot",
  "sandbox": {
    "filesystem": {
      "read": ["~/Downloads", "~/Desktop"],
      "write": ["~/Downloads/archive"],
      "deny": ["~/Documents", "~/Library", "~/.ssh"]
    },
    "network": "none",
    "shell": false
  }
}An example permission config. Deny by default. Allow exactly what's needed.

Defense two: human-in-the-loop

AGENT: I want to run: rm -rf ~/old_project
       Reason: You asked me to clean up old projects.
       Impact: Deletes 1.2 GB, 847 files.
       
AWAITING APPROVAL (y/n/details):What an approval gate should look like: action, reason, impact, pause.

Tiered approval

Risk	Approval policy	Examples
Low (read-only)	Auto-approve.	List files, read a page, query an API.
Medium (reversible)	Batch-approve with review.	Create a branch, draft an email.
High (destructive)	Per-action human approval.	Delete files, send emails, make payments.
Critical (irreversible)	Two-person approval + logging.	Deploy prod, wire money, delete users.

Good agent design is mostly about building the safe envelope, not the clever prompt. Get the envelope right and you can run bold experiments. Get it wrong and one bad run can ruin a week.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-safety-sandbox-builders

What is the core idea behind "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent safe: wall it off from things it shouldn't touch, and put a human in the decision path.
2. Recover attribution for untagged historical calls.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which term best describes a foundational idea in "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. human-in-the-loop
2. sandbox
3. least privilege
4. approval gate
A learner studying Agent Safety: Sandboxes and Human-in-the-Loop would need to understand which concept?
1. sandbox
2. least privilege
3. human-in-the-loop
4. approval gate
Which of these is directly relevant to Agent Safety: Sandboxes and Human-in-the-Loop?
1. sandbox
2. human-in-the-loop
3. approval gate
4. least privilege
Which of the following is a key point about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Vercel Sandbox — Firecracker microVMs for running untrusted code.
2. Docker containers with bind-mounted folders only (classic and free).
3. Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
4. Browser-only agents (Browser Use, Operator) — can't touch your filesystem.
Which of these does NOT belong in a discussion of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Vercel Sandbox — Firecracker microVMs for running untrusted code.
2. Anthropic's Claude Code with scoped permissions (per-directory, per-tool).
3. Recover attribution for untagged historical calls.
4. Docker containers with bind-mounted folders only (classic and free).
What is the key insight about "Assume the agent will mess up" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Not because AI is evil, but because AI is wrong some of the time.
4. Choosing tools that match keywords in user requests
What is the key insight about "The rule of three" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Choosing tools that match keywords in user requests
4. Three questions before giving an agent any permission: (1) What's the worst that could happen? (2) How would I notice? (…
What is the key insight about "Never give agents your master password" in the context of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Use API keys with scoped permissions. Use app-specific passwords. Revoke when a task finishes.
2. Recover attribution for untagged historical calls.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which statement accurately describes an aspect of Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. An agent with filesystem access can delete your thesis. An agent with email access can send 'you're fired' to your entire team.
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
What does working with Agent Safety: Sandboxes and Human-in-the-Loop typically involve?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. A sandbox is a walled space where the agent can only see and touch specific things. Outside the wall, the agent has no power.
4. Choosing tools that match keywords in user requests
Which of the following is true about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Choosing tools that match keywords in user requests
4. For any action that's destructive (delete, send, pay, push), require a human to approve before the agent proceeds. Yes, it slows things down.
Which best describes the scope of "Agent Safety: Sandboxes and Human-in-the-Loop"?
1. It focuses on Giving an AI the keys to your computer is a big deal. Learn the two simplest ways to keep an agent s
2. It is unrelated to agentic workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Defense one: sandbox
3. Including correlation IDs across distributed calls
4. Choosing tools that match keywords in user requests
Which section heading best belongs in a lesson about Agent Safety: Sandboxes and Human-in-the-Loop?
1. Recover attribution for untagged historical calls.
2. Including correlation IDs across distributed calls
3. Defense two: human-in-the-loop
4. Choosing tools that match keywords in user requests

← Back to interactive lesson