Loading lesson…
Computer Use lets Claude see your screen and use it — mouse, keyboard, apps. The capability is real, the gotchas are real. A hands-on look at what works in 2026.
Anthropic's Computer Use API (GA beta since late 2025) lets Claude see a screen, reason about it, and output actions — 'click at (x,y)', 'type this', 'scroll down'. Combined with a screenshot loop, Claude can use any desktop app: spreadsheets, design tools, browsers, legacy line-of-business software with no API.
import anthropic
from PIL import Image
import base64, io, pyautogui
client = anthropic.Anthropic()
def screenshot_b64():
img = pyautogui.screenshot()
buf = io.BytesIO()
img.save(buf, format="PNG")
return base64.b64encode(buf.getvalue()).decode()
tools = [{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
}]
messages = [{"role": "user", "content": "Open Calculator and compute 137*42."}]
while True:
resp = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=tools,
messages=messages,
betas=["computer-use-2025-01-24"],
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
break
for block in resp.content:
if block.type == "tool_use" and block.name == "computer":
act = block.input
if act["action"] == "screenshot":
result = {"type": "image", "source": {"type": "base64",
"media_type": "image/png", "data": screenshot_b64()}}
elif act["action"] == "left_click":
pyautogui.click(act["coordinate"][0], act["coordinate"][1])
result = "clicked"
elif act["action"] == "type":
pyautogui.typewrite(act["text"])
result = "typed"
# ... key, scroll, mouse_move, etc.
messages.append({"role": "user", "content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": [result] if isinstance(result, dict) else result
}]})A minimal Computer Use loop. Claude asks for a screenshot, you provide one, it chooses an action, you execute, repeat.Claude Sonnet's OSWorld score — the industry's hardest desktop benchmark — jumped from under 15% in late 2024 to 72.5% by April 2026. The leap came largely from Anthropic's February 2026 acquisition of Vercept, a vision-perception team. Practically: Claude now locates buttons, text fields, menus, and state with human-level reliability on common apps.
| Task type | Reliability (April 2026) | Notes |
|---|---|---|
| Spreadsheet fills from structured input | High | Google Sheets, Excel — strong. |
| Form-filling in legacy apps | High | Including Citrix and VDI sessions. |
| Multi-app workflows (copy/paste) | Medium-high | Occasionally misses focus changes. |
| Creative design tools | Medium | Figma yes; Photoshop hit-or-miss. |
| Fast-moving web UIs (SPA) | Medium | Screenshots lag SPA state changes. |
| Captchas | Low | Still mostly blocked. |
| Games | Low | Not designed for timed reflexes. |
Computer Use is the right tool when no API exists. For anything with a clean API, use the API — it's 100x faster and cheaper.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-agentic-computer-use-api-creators
What is the core idea behind "Computer Use API: Letting AI Click Through GUIs"?
Which term best describes a foundational idea in "Computer Use API: Letting AI Click Through GUIs"?
A learner studying Computer Use API: Letting AI Click Through GUIs would need to understand which concept?
Which of these is directly relevant to Computer Use API: Letting AI Click Through GUIs?
Which of the following is a key point about Computer Use API: Letting AI Click Through GUIs?
Which of these does NOT belong in a discussion of Computer Use API: Letting AI Click Through GUIs?
Which statement is accurate regarding Computer Use API: Letting AI Click Through GUIs?
Which of these does NOT belong in a discussion of Computer Use API: Letting AI Click Through GUIs?
What is the key insight about "It can click things you didn't want clicked" in the context of Computer Use API: Letting AI Click Through GUIs?
What is the key insight about "Alternatives and cousins" in the context of Computer Use API: Letting AI Click Through GUIs?
What is the key insight about "Prompt injection via pixels" in the context of Computer Use API: Letting AI Click Through GUIs?
Which statement accurately describes an aspect of Computer Use API: Letting AI Click Through GUIs?
What does working with Computer Use API: Letting AI Click Through GUIs typically involve?
Which of the following is true about Computer Use API: Letting AI Click Through GUIs?
Which best describes the scope of "Computer Use API: Letting AI Click Through GUIs"?