A WebRTC audio frame travels through 8-12 processing stages between mic and speaker. Each stage adds latency, quality changes, or robustness. Knowing them lets you debug 'why does my voice sound robotic' without guessing.
Advertisement
Capture and pre-processing
Mic → APM (Audio Processing Module): echo cancellation, noise suppression, AGC, voice activity detection. Adds ~10ms. Tunable per browser/library; lifelong source of 'why does my voice sound weird' bugs.
Encoding and transport
APM output → Opus encoder (10-20ms frames). Encoded packets → RTP → SRTP encryption → ICE/STUN/TURN for NAT traversal. Total adds 20-50ms baseline.
Advertisement
Decoding and playback
RTP packets → jitter buffer (10-200ms) → Opus decoder → mixer (if multi-party) → output device. Total ear-to-ear latency on a healthy connection: 100-300ms.
APM → Opus → SRTP → jitter buffer → Opus decode → output. Each stage budgets latency; total ~150ms typical.