Engineering Deep Dive: The Proven Commands Ledger

Why exact executed Playwright commands survive context pruning better than summaries, and where Set-based dedup still matters.

The generator's hardest job is not clicking buttons. It is remembering exactly which buttons it clicked after the conversation has been pruned three times. If a model reaches final code synthesis with only a fuzzy memory like "I think the button said Sign In," it starts abbreviating locator names, rephrasing labels, and inventing steps it did not really execute. That is how you get pretty generated tests that fail immediately.

The original design note for this problem was simple: keep a Set<string> of exact Playwright commands that have already worked, because set membership is O(1) and deduplication should never become a second memory problem. That design is still conceptually right. The current implementation evolved into a richer LedgerEntry[] because we also need route grouping, assertion markers, success state, retries, and iteration numbers. But the dedup core is still set-based when it matters.

What Enters The Ledger

Not every tool result is useful memory. The generator only records tools whose output can be transcribed into real Playwright: clicks, typing, option selection, drag, navigation, waits, and the verification calls that confirm assertions. Every entry stores a step number, the MCP tool name, extracted code lines, the current route, whether the tool succeeded, whether the entry is an assertion, and the iteration it came from.

That route field is more important than it looks. The final session recording groups commands by page, which gives the model a much stronger scaffold for transcription than a flat bag of lines. When the page boundary changes, the rendered ledger inserts # Page: /route markers. That, in turn, makes it obvious where page.waitForURL() belongs in the final code.

Deduplication Has To Be Constant Time

Why insist on O(1) dedup? Because the ledger is hit on every successful tool result and again during pruning and persistence. If dedup were an O(n) scan over growing arrays, memory preservation itself would become a scaling problem during long generations. The implementation uses a set when deriving persisted proven commands from the final test code:

Topics: Engineering, Deep Dive, Architecture, Prompting.

Read the full article · Get Started Free