Flink jobs are stateful — windows, joins, aggregations all keep state per key. The state backend decides where that state lives. Memory backends are fast but bounded by RAM; RocksDB scales but adds latency. Picking wrong causes OOM or kills throughput.
HashMap (in-memory) backend
State in Java heap as HashMap
RocksDB backend
State in embedded RocksDB on local SSD. Access is ~100µs (vs nanoseconds for memory). Scales to TBs. Checkpoint = upload SSTable files incrementally to object store (very fast, only new files transferred).
Switching at scale
Many production jobs start on HashMap, migrate to RocksDB when state grows. The migration is operationally non-trivial: state schema must be compatible, checkpoint format differs. Easier: start on RocksDB from day one even for small jobs.
Tuning RocksDB
Default RocksDB config is conservative. For Flink: increase state.backend.rocksdb.block.cache-size to 1GB+ to keep hot keys in memory. state.backend.rocksdb.thread.num to # of cores. state.backend.rocksdb.compression to lz4 for speed or zstd for size.
When state backend isn't the bottleneck
If your throughput is limited by deserialization, network shuffle, or sink writes, the state backend choice doesn't matter. Profile first. Switching from HashMap to RocksDB to fix a CPU-bound job won't help.