VAD tells you whether the current audio frame contains speech. Used everywhere: skip transmitting silence (bandwidth), trigger STT (wake word), detect end-of-utterance (turn-taking in voice agents). The 2026 baseline is neural — much better than energy threshold.
Energy-threshold VAD
Compare frame energy to background noise floor. Cheap, but fooled by stationary noise (fans, traffic). Acceptable for bandwidth savings; not for triggering ASR.
Neural VAD (Silero, py-webrtcvad)
Small CNN/RNN classifies frames. ~5-10ms per frame, near-zero false positives on stationary noise. Default in voice agents (LiveKit, Whisper streaming).
End-of-utterance detection
Silence duration after detected speech. Tune by language and intent: English commands ~500ms, conversational ~800-1200ms. Too short = cuts user off; too long = laggy agent. Often the highest-leverage UX tuning in a voice agent.