Computer Use API: Letting AI Click Through GUIs

Computer Use lets Claude see your screen and use it — mouse, keyboard, apps. The capability is real, the gotchas are real. A hands-on look at what works in 2026.

Creators · Agentic AI · ~29 min read

Print / PDF

The capability

Anthropic's Computer Use API (GA beta since late 2025) lets Claude see a screen, reason about it, and output actions — 'click at (x,y)', 'type this', 'scroll down'. Combined with a screenshot loop, Claude can use any desktop app: spreadsheets, design tools, browsers, legacy line-of-business software with no API.

The minimal loop

A minimal Computer Use loop. Claude asks for a screenshot, you provide one, it chooses an action, you execute, repeat.

python

import anthropic from PIL import Image import base64, io, pyautogui client = anthropic.Anthropic() def screenshot_b64(): img = pyautogui.screenshot() buf = io.BytesIO() img.save(buf, format="PNG") return base64.b64encode(buf.getvalue()).decode() tools = [{ "type": "computer_20250124", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, }] messages = [{"role": "user", "content": "Open Calculator and compute 137*42."}] while True: resp = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=2048, tools=tools, messages=messages, betas=["computer-use-2025-01-24"], ) messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason == "end_turn": break for block in resp.content: if block.type == "tool_use" and block.name == "computer": act = block.input if act["action"] == "screenshot": result = {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64()}} elif act["action"] == "left_click": pyautogui.click(act["coordinate"][0], act["coordinate"][1]) result = "clicked" elif act["action"] == "type": pyautogui.typewrite(act["text"]) result = "typed" # key, scroll, mouse_move, etc. messages.append({"role": "user", "content": [{ "type": "tool_result", "tool_use_id": block.id, "content": [result] if isinstance(result, dict) else result }]})

The 2026 capability spike

Claude Sonnet's OSWorld score — the industry's hardest desktop benchmark — jumped from under 15% in late 2024 to 72.5% by April 2026. The leap came largely from Anthropic's February 2026 acquisition of Vercept, a vision-perception team. Practically: Claude now locates buttons, text fields, menus, and state with human-level reliability on common apps.

Where it works well vs. not

Compare the options

Task type	Reliability (April 2026)	Notes
Spreadsheet fills from structured input	High	Google Sheets, Excel — strong.
Form-filling in legacy apps	High	Including Citrix and VDI sessions.
Multi-app workflows (copy/paste)	Medium-high	Occasionally misses focus changes.
Creative design tools	Medium	Figma yes; Photoshop hit-or-miss.
Fast-moving web UIs (SPA)	Medium	Screenshots lag SPA state changes.
Captchas	Low	Still mostly blocked.
Games	Low	Not designed for timed reflexes.

Deployment patterns

Containerized Linux desktop (reference impl): Xvfb + xdotool + Chromium. Clean, disposable.
Dedicated VM: separate user account, snapshot before each run, restore after.
Local macOS/Windows with pyautogui: fastest dev loop, riskiest (it IS your desktop).
Cloud browser providers (Browserbase, Anchor Browser): browser-only, not full desktop.

Cost and latency

Every screenshot is a vision input — expect 1500–4000 input tokens per turn.
A 10-step task typically costs $0.10–$0.50 on Sonnet 4.6 ($3/$15 per M tokens, 2026).
P50 latency ~3–6s per step. Much slower than an API. Batch decisions when possible.
Caching the system prompt saves 50%+ on prefix-heavy loops.

Computer Use is the right tool when no API exists. For anything with a clean API, use the API — it's 100x faster and cheaper.

Key terms in this lesson

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Tutor

Curious about “Computer Use API: Letting AI Click Through GUIs”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Computer Use API: Letting AI Click Through GUIs

The capability

The minimal loop

The 2026 capability spike

Where it works well vs. not

Deployment patterns

Cost and latency

Curious about “Computer Use API: Letting AI Click Through GUIs”?

Keep going

Computer Use API: Letting AI Click Through GUIs

The capability

The minimal loop

The 2026 capability spike

Where it works well vs. not

Deployment patterns

Cost and latency

Curious about “Computer Use API: Letting AI Click Through GUIs”?

Keep going