The Agentic Screenshot Loop: Debugging Browser Automation with Claude Code

I have an eight-step Playwright automation that fills out orders on a medical lab portal. Login, provider selection, patient info, test panels, ICD codes, insurance, submission. It works. As an exercise, I decided to port it to agent-browser — a CLI that interacts with pages through the accessibility tree instead of CSS selectors.

The port forced me into a debugging workflow I hadn’t used before: a tight loop between annotated screenshots and Claude Code. The workflow turned out to be more interesting than the port itself.

How Agent-Browser Works

Instead of Playwright’s page.getByRole('button', { name: 'NEXT' }).click(), agent-browser takes a snapshot of the accessibility tree and returns numbered refs:

- button "close" [ref=e1]
- radio "Bill to Insurance" [ref=e2]
- textbox "Insurance Company Name*" [ref=e8]
- radio "Self" [ref=e9]
- button "NEXT" [ref=e18]

You interact by ref: click @e18, fill @e8 "Aetna". Refs are scoped to each snapshot — they change whenever the page state changes. There’s no persistent handle to an element. Every interaction starts with a fresh snapshot.

I wrapped this in an AgentBrowserPage class to keep the step code readable:

await page.snapshot();
const nextRef = page.findRef('button', 'NEXT');
await page.clickRef(nextRef);

The first six steps ported cleanly. Step seven broke in a way that sent me down a rabbit hole.

The Screenshot Loop

Browser automation errors lie to you. The error said “can’t find Insurance Company Name field.” The actual problem was that the page hadn’t navigated — we were still on the test selection step, not the insurance step. The field wasn’t missing; we weren’t on the right page.

This is the fundamental challenge with browser automation debugging. The state is invisible unless you look at it. Log output tells you what the code did, not what the page looks like.

Agent-browser has a screenshot command that overlays red numbered boxes on every interactive element, matching the refs from the snapshot. This is the key to the whole workflow. When you take an annotated screenshot, you see exactly what the agent sees — every clickable element, every fillable field, labeled with its ref number.

The loop:

Add a screenshot call before the failing step
Run the automation
Open the screenshot — is the page what you expect?
Hand the screenshot to Claude Code — “here’s the page, here’s the code, find the mismatch”
Fix, re-run, screenshot again

Step 4 is where this becomes agentic. Claude Code is multimodal — it reads the annotated screenshot, cross-references the ref numbers against the code, and identifies mismatches directly. The conversation is short: “the screenshot shows two NEXT buttons, your code grabs the first match, that’s the pagination button not the modal button.” Fix. Re-run. New screenshot. Verify.

What the Port Taught Me

Porting working Playwright code to agent-browser isn’t a syntax translation. The two tools have fundamentally different interaction models, and the gaps show up in specific ways.

Ambiguity in the accessibility tree

Playwright’s getByTestId('next-button-test') is unambiguous — it targets one element. Agent-browser’s findRef('button', 'NEXT') returns the first button in the accessibility tree whose label includes “NEXT.” When there are two NEXT buttons on the page (one in a modal, one in a pagination bar), you get the wrong one.

The fix is to use more specific commands — find testid next-button-test click — or drop into page.evaluate to query the DOM directly. Agent-browser supports both, but you have to know when the accessibility tree isn’t specific enough. The annotated screenshot makes this obvious: if two elements have similar labels, you can see it.

Wait strategies don’t port automatically

The Playwright version had retry loops, .waitFor() calls, and visibility checks baked into every step. When porting, it’s tempting to just translate the clicks and fills and add a fixed delay. But browser automation is asynchronous — page transitions take variable time, and a 500ms delay that works locally will fail on a slow connection.

The pattern that works: poll for a known element on the target page.

for (let attempt = 1; attempt <= 10; attempt++) {
  await delay(1000);
  const found = await page.evaluate(() =>
    document.body.innerText.includes('Bill to Insurance')
  );
  if (found) break;
}

This is what the Playwright version was already doing with its selector retry loops. The agent-browser version just needs to do it explicitly.

The accessibility tree has gaps

Some elements don’t surface cleanly in the accessibility tree. A checkbox whose label is a 200-word legal paragraph won’t match a findRef('checkbox', 'Signed and submitted by') call, because the snapshot label is the full paragraph text, not the substring you’re searching for.

page.evaluate is the escape hatch. You drop into the browser context and query the DOM by name attribute, id, or whatever selector the element actually has:

await page.evaluate(() => {
  const cb = document.querySelector('input[name="providerAuthorization"]');
  if (cb && !cb.checked) cb.click();
});

This is the same thing Playwright does under the hood. Agent-browser just makes the boundary between accessibility tree and DOM more explicit.

Three Interaction Modes

The system that emerged uses three modes, chosen per-element based on what the annotated screenshot reveals:

Snapshot refs for elements with unique, short labels — buttons, text fields, radio buttons
Testid commands for elements with data-testid attributes — stable and unambiguous
page.evaluate for everything the accessibility tree can’t cleanly surface — long labels, nameless elements, anything needing DOM traversal

Not elegant. But each mode covers the others’ weaknesses, and the screenshot makes it clear which mode to reach for. If the element shows up with a clean label in the ref list, use refs. If it has a testid, use that. If neither, evaluate.

Why This Is an Agentic Workflow

The screenshot loop works specifically because Claude Code can read images. Previous approaches to browser automation debugging relied on log parsing, DOM dumps, or manual inspection. The annotated screenshot collapses all of that into a single artifact that both the human and the AI agent can reason about simultaneously.

The conversation pattern is consistent:

Me: “It’s failing at the insurance step. Here’s the screenshot.”
Claude Code: reads screenshot, reads code, identifies that the page hasn’t navigated
Me: “Fix it.”
Claude Code: writes the fix, runs it, takes a new screenshot, verifies

Each iteration takes a couple of minutes. The screenshot eliminates the back-and-forth of “what’s on the page” / “can you check if X element exists” / “what does the DOM look like.” You both see the same thing. The debugging conversation is about why it’s wrong, not what is wrong.

If you’re doing any browser automation with AI agents, build annotated screenshots into your workflow. It changes the debugging loop from “describe the page to me” to “look at this and tell me what’s wrong.”