Engineering Deep Dive: How We Built Self-Healing Tests

A code-level walkthrough of the Validate.QA healing orchestrator: logic preflight, script-driven repair, MCP browser fallback, and rate-limit detection.

"Self-healing tests" is one of those phrases that sounds impressive and means almost nothing unless you describe the failure loop in detail. A lot of products use it to mean "we retried and the flake went away" or "we swapped a selector after a DOM diff." That is not what we built at Validate.QA. Our implementation is a hybrid orchestrator with a logic preflight, a cheap script-driven phase, and a browser-backed fallback that can walk the live UI through MCP when static reasoning is no longer enough.

The reason the system is split is economic, not aesthetic. Most broken tests do not need a full browser agent. They need a better locator, a page-ready wait, a corrected URL regex, or a popup dismissal that should have been wrapped in a try-catch. Those are cheap to infer from failing code, screenshots, and step traces. But a minority of failures are real UI mutations: a modal became a page, a sign-in form was redesigned, a navigation step now lands in a different scope. For those, you need live exploration and fresh proven commands from the browser itself.

The entry point is executeTestWithHealing() in mcp-healer.ts. The contract is simple: given the failing Playwright code, test steps, and the last run context, return one of several real outcomes. The test passes with healed code, the application is broken and should be quarantined as @bug-found, the environment is misconfigured and should be reported as ERROR, or the healer genuinely could not produce a safe fix.

Step 0: Refuse To Heal Rate Limits

The most important architecture choice happens before the first AI call. We added a pure logic rate-limit guard because an LLM is the wrong tool for identifying obvious 429s. The preflight runs in detectRateLimitBlock() and inspects the prior failure error, logs, step-level API summaries, and tags. It short circuits on three primary signals: HTTP 429, an exhausted ratelimit-remaining header, or a body phrase like "too many requests" or "try again in 30 seconds."

That sounds trivial, but it changes system behavior dramatically. Without this guard, a healer wastes six Phase 1 attempts and four browser iterations trying to "fix" perfectly correct Playwright that just happened to hit application throttling. With the guard, the run is converted to CONFIG_ISSUE: RATE_LIMIT_BLOCKED, posted as ERROR instead of FAILED, and excluded from quarantine logic. The server can then schedule a delayed retry and enforce a one-retry-per-hour cap to avoid storms.

Topics: Engineering, Deep Dive, Architecture, Self-Healing.

Read the full article · Get Started Free