Network packets don't arrive on a perfectly steady cadence — some come early, some late, occasional ones go missing. A jitter buffer holds incoming audio packets briefly and releases them at the player's clock rate, smoothing out the irregularity. The art is buffering just enough — too little glitches, too much adds latency.
Fixed vs adaptive buffers
Fixed buffer: always holds N packets (e.g., 60 ms). Simple, deterministic latency, but bad in changing network conditions. Adaptive: tracks recent jitter statistics, grows buffer in bad networks, shrinks in good. WebRTC's NetEQ is the gold standard.
The math
Target buffer depth = mean + k × std-dev of last-N packet inter-arrival times. With k=2 you cover ~95% of jitter; with k=3, ~99%. Update mean/std every 100ms — too often and you chase noise, too rarely and you lag behind regime changes.
Late vs missing packet decision
def schedule_packet(seq, arrival_ms, play_ms):
if arrival_ms > play_ms:
if arrival_ms - play_ms < LATE_THRESHOLD_MS:
return 'PLAY_LATE' # better late than never
return 'DISCARD' # too late, would cause out-of-order
return 'BUFFER' # arrived in timeStretching vs concealment
Adaptive buffers can time-stretch audio (play 1.01x or 0.99x) to dynamically grow/shrink without dropouts. WSOLA (Waveform Similarity Overlap-Add) preserves pitch while changing speed. Less artifact than packet duplication or silence padding.
Trade-off curve
| Buffer size | Latency | Glitch rate |
|---|---|---|
| 20 ms | Low | High (~1%) |
| 60 ms | Medium | Low (~0.1%) |
| 200 ms | High (noticeable) | Very low (~0.01%) |