STT in 2026 means real-time accuracy that was research-grade five years ago. The choice between open (Whisper) and managed (Deepgram, AssemblyAI) is now mostly about latency, cost at your scale, and language coverage.
Whisper-class models
OpenAI Whisper, Distil-Whisper, faster-whisper. Open weights, self-hostable. Excellent multilingual. Large-v3 ~5% WER on English. Faster-whisper on GPU does real-time at ~0.05x RT factor.
Streaming-first managed services
Deepgram Nova: lowest streaming latency (~200ms). AssemblyAI: best diarization. Google STT: best language coverage. Pay-per-minute or per-character. Right for low-effort production deployment.
Speaker diarization
Who said what. Pyannote (open), AssemblyAI (managed). 95%+ accuracy on clean 2-speaker audio. Drops fast with crosstalk, accents, far-field mics. Plan post-processing or human review for sensitive use cases.