We publish an accuracy number — and the method behind it
85% valid-test rate (as of June 2026). Self-reported on our internal benchmark suite. The full method is below so it can be reproduced and challenged — and we're formalizing an independent, public benchmark.
Of the tests Validate.QA generates from a URL, the share that are correct and meaningful — they run, pass for the right reason, and assert something that actually matters. It is deliberately not a vanity "tests created" count.
Valid-test rate: The percentage of generated tests that are correct and meaningful: they execute, pass for the right reason, and assert something real — not just "a page rendered." This is our headline number.
False-positive rate: How often a test fails when the app is actually fine — a drifted selector or a bad assertion. False positives erode trust faster than anything else, so we track this explicitly and drive it down.
Bug recall: Of the real defects present in a flow, how many the generated tests actually catch. Recall matters more than raw count — a thousand tests that miss the bug are worthless.
Flake rate: How often a test returns a different result on re-run with no change. Flake is the top reason teams abandon suites; assertions pinned to dynamic data are rejected before they ship.
A fixed set of real apps — We run against real web apps that exercise the things that actually break tools — authentication, multi-step forms, dashboards, and checkout — plus apps with deliberately planted, known bugs so we can measure recall, not just pass rate.
Tested exactly like a customer — Each app is driven the way you'd use it: paste the URL, supply test credentials for gated areas, and let the agent explore. No hand-authoring, no per-app tuning — the benchmark measures the product, not a demo.
Every test is labeled — Each generated test is scored as valid (passes for the right reason with a real assertion), false-positive (fails on a healthy app), or invalid (meaningless or duplicated). Assertions are verified against the live DOM during generation, before a test is trusted.
Recall on seeded bugs — On the apps with known defects, we measure how many the generated suite catches — and confirm each real bug is quarantined and flagged as a found bug, never quietly "healed" into passing.
Re-run for flake — Suites are re-run to measure flake. Assertions tied to ephemeral data (timestamps, generated IDs) are rejected by a stability classifier and replaced with durable checks before shipping.
Re-measured every release — The numbers are re-run each release rather than quoted once. Because the output is native Playwright you own, you can reproduce them yourself by reading and running the tests.
The denominator is meaningful tests — We don't inflate the number with trivial checks. "A page loaded" assertions don't count toward valid tests — only checks that prove the app actually did the right thing.
We never rewrite a real bug to pass — When the app is genuinely broken, the test is quarantined and tagged as a found bug — not silently rewritten to go green. Accuracy can't be gamed by hiding defects, because hiding a defect lowers bug recall.
You can audit the number yourself — Every generated test is native Playwright in your own repo. Nothing is locked in a dashboard — so our accuracy claim is one you can verify by reading and running the code.
These figures are self-reported as of the date shown, measured on our internal benchmark suite. An independent, reproducible benchmark is in progress.
Accuracy varies by app. Standard CRUD apps score higher than heavily stateful flows or unusual auth; a single headline number hides that spread, which is why the method above matters.
It is not a full penetration test — the security layer is an automated audit — and it does not replace human exploratory testing of judgment-heavy cases.
Numbers change as the product and the benchmark evolve. We'd rather publish a method you can hold us to than a static marketing figure.
See the methodology · Get Started Free