Engineering Deep Dive: Batch Heal Orchestrator Internals

Why triage, shared auth repair, isolated mini-loops, and suite review make batch healing dramatically cheaper than naive per-test repair.

Per-test healing is the obvious architecture and the wrong economic baseline. If 30 tests fail after the same auth-page redesign, healing them independently means 30 agents re-read the same helper file, rediscover the same missing button, and spend the same expensive tokens on the same root cause. That is why repo-mode healing in Validate.QA lives in heal-orchestrator.ts instead of reusing the single-test MCP healer.

The orchestrator treats a failing suite as a correlated batch. It runs one AI triage over all recent failures, applies mechanical decisions without AI where possible, fixes shared auth once if the evidence says auth is systemic, runs isolated 5-turn mini-loops only on the remaining tests, and then performs a 6-turn review before the suite verifier does a real run. That is where the 25x cost reduction comes from. Not from a cheaper model, but from avoiding redundant work.

Phase 1: Triage As A Planning Compression Layer

Triage is one audited AI call over the whole failing set. The input is already pre-grouped with up to three distinct error strings per test, fail counts, total recent runs, tags, and whether a committed file path exists. The model must return a single classify_tests tool call covering every failure. Decisions are constrained to fix, flaky, or skip, with an optional mechanical fix hint and a separate boolean for shared auth-helper failure.

The triage prompt is not just "use your judgment." It encodes concrete rules. Known flaky tests with low fail counts stay flaky. HTTP 500s become skip decisions that can later quarantine as app bugs. Timing failures with very low fail counts lean flaky. Persistent known flaky tests are forced back into fix territory. And if three or more @needs-auth tests fail with auth-shaped errors, the orchestrator treats auth as the likely shared root cause.

Phase 2: Mechanical Decisions With Zero Browser Cost

After triage, tests marked flaky or skip do not go through an agent loop at all. Flaky tests get tagged @flaky. Skipped tests get @heal-skip, and if the skip reason starts with APP_BUG: the test is quarantined immediately. This matters because doing nothing is a valid outcome when the suite is telling you about a product regression rather than a locator problem.

Topics: Engineering, Deep Dive, Architecture, Orchestration.

Read the full article · Get Started Free