A WebRTC video frame goes from camera sensor → 30 stages → display on the far end, in under 200ms. Most of those stages are invisible. Understanding the pipeline is what lets you debug 'why is video frozen' or 'why is latency 800ms instead of 200'.

Advertisement

Capture + encode

Camera sensor (~33ms per frame at 30fps). Hardware encoder (~10-30ms) produces H.264/VP8/VP9/AV1 frame. RTP packetizer splits into ~1200-byte payloads, adds RTP headers + sequence numbers + timestamps.

Network transport

DTLS-SRTP encrypts each RTP packet. UDP socket sends to peer (direct or via TURN server). Network adds 10-200ms RTT. Per-packet jitter handled by jitter buffer on the receive side.

Advertisement

Jitter buffer

Receive side queues incoming packets briefly (~30-100ms) to absorb jitter and reorder out-of-order packets. Adaptive size — grows in bad networks, shrinks in good. Cost: latency added equal to buffer depth.

Decode + render

RTP depacketizer reassembles frames. Hardware decoder (~5-20ms) produces raw YUV. Renderer color-converts to RGB and displays. Output frame rate matches input rate via display sync.

Where latency hides

StageTypical latency
Capture33ms
Encode10-30ms
Network RTT/210-100ms
Jitter buffer30-100ms
Decode5-20ms
Render16-33ms
Total~150-300ms
Capture → encode → transport → jitter buffer → decode → render. Each stage has measurable latency; jitter buffer is biggest knob.