A WebRTC video frame goes from camera sensor → 30 stages → display on the far end, in under 200ms. Most of those stages are invisible. Understanding the pipeline is what lets you debug 'why is video frozen' or 'why is latency 800ms instead of 200'.
Capture + encode
Camera sensor (~33ms per frame at 30fps). Hardware encoder (~10-30ms) produces H.264/VP8/VP9/AV1 frame. RTP packetizer splits into ~1200-byte payloads, adds RTP headers + sequence numbers + timestamps.
Network transport
DTLS-SRTP encrypts each RTP packet. UDP socket sends to peer (direct or via TURN server). Network adds 10-200ms RTT. Per-packet jitter handled by jitter buffer on the receive side.
Jitter buffer
Receive side queues incoming packets briefly (~30-100ms) to absorb jitter and reorder out-of-order packets. Adaptive size — grows in bad networks, shrinks in good. Cost: latency added equal to buffer depth.
Decode + render
RTP depacketizer reassembles frames. Hardware decoder (~5-20ms) produces raw YUV. Renderer color-converts to RGB and displays. Output frame rate matches input rate via display sync.
Where latency hides
| Stage | Typical latency |
|---|---|
| Capture | 33ms |
| Encode | 10-30ms |
| Network RTT/2 | 10-100ms |
| Jitter buffer | 30-100ms |
| Decode | 5-20ms |
| Render | 16-33ms |
| Total | ~150-300ms |