Flink's checkpointing is the foundation of its exactly-once guarantees. Misconfigured checkpoints either inflate latency (too frequent), risk data loss (too rare), or pile up under backpressure (slow checkpoints). Knowing the dials matters.
Frequency vs latency
Checkpoint interval 1-10s for low-latency streaming. State writes between checkpoints buffer up. Long intervals = more data to replay on failure, but lower steady-state overhead.
Aligned vs unaligned
Aligned: operator waits for barriers from all inputs before checkpointing — clean but blocks under backpressure. Unaligned: barriers overtake records — checkpoints complete fast under load, more state to save.
State backend choice
RocksDB: large state, disk-backed. In-memory: small state, fast. Distributed: durability via blob storage (S3, HDFS) for restore. Pick by state size; switching later requires application down-time.