Engineering Deep Dive: Voice Intent Correlation Within 3 Seconds

The timing, batching, and scoring pipeline that aligns Whisper transcript segments with DOM actions on a shared test timeline.

Browser actions tell you what happened. Voice narration tells you why it mattered. If you only record clicks and inputs, a generated test can reproduce mechanics but misses the user's intent: "this should show an error," "the total should update," "I expect to land on dashboard." If you only keep the transcript, you know the expectation but not the exact UI action that triggered it. The quality jump comes from correlating both streams on a shared session timeline.

The correlation system in Validate.QA is intentionally simple and fast. Audio is streamed in 15-second WebM Opus chunks, transcribed with Whisper at segment granularity, classified in batches of 20 transcript segments, and then matched to DOM actions inside a plus-or-minus 3-second temporal window. The output is not "a transcript attached to a session." It is an enriched timeline where specific expectations and commands are anchored to specific browser events.

Chunk Streaming Sets Up Everything Else

The voice pipeline starts in the Chrome extension's offscreen document because Manifest V3 requires microphone access to happen outside the visible tab. Audio is captured via MediaRecorder, split into 15-second chunks, and uploaded immediately to POST /api/sessions/:id/audio. Each chunk includes chunkOffsetMs, the elapsed time since session start, so the server can project per-chunk Whisper timestamps back onto a single session-relative clock.

The upload route responds immediately after storing chunk metadata, then runs transcription in the background. That detail exists for a practical reason: Chrome can tear down the offscreen document as soon as capture ends, so waiting for Whisper synchronously risks losing in-flight uploads. The server marks chunk state as PENDING, flips it to IN_PROGRESS, transcribes, stores segments, then marks the chunk completed.

Whisper Is Used For Segments, Not For Meaning

Whisper's job is transcription plus timing, not interpretation. The server uses whisper-1 with verbose_json and segment-level timestamps. It also filters obvious hallucinations such as "thanks for watching" or near-silent filler based on known text patterns, no_speech_prob, and average log probability. The result is a list of timestamped transcript segments with a clamped confidence score.

Topics: Engineering, Deep Dive, Architecture, Voice AI.

Read the full article · Get Started Free