Voice agents are the highest-stakes UX for ADKs — every latency millisecond matters, every misrecognition embarrasses. The pipeline is conceptually simple (STT → LLM → TTS) but production-ready voice agents have 8-12 components and a tight latency budget.
Latency budget
Target: <500ms from user stops speaking to agent starts speaking. Budget: VAD (50ms) + STT finalize (100ms) + LLM TTFT (150ms) + TTS first audio (100ms) + transport (50ms). Anything over 500ms feels laggy.
Streaming everywhere
Don't wait for full STT result; start LLM on partial transcripts (with debounce). Don't wait for full LLM response; stream TTS on partial text. Interruption handling: VAD on user side cancels in-flight LLM/TTS.
End-of-utterance detection
Silence threshold after detected speech. 500-800ms typical. Too short: cuts user off. Too long: laggy response. The single highest-leverage UX tuning. Adapt per language and conversation type.
State and tool use
Voice agents call tools (book appointment, check status). Tool descriptions matter doubly: model has to pick the right tool fast (latency budget). Pre-warm common tools' state to avoid cold-start in the loop.
Failure modes
STT misrecognition (especially names, addresses). TTS mispronunciation. LLM hallucinating tool args. Network latency spikes. Each needs a recovery: clarify, spell out, verify, retry. Real voice agents have these built in or feel brittle.