Any real system needs to rebalance. Adding capacity. Splitting hotspots. Retiring failed shards. Done poorly = downtime. Done well = invisible.

Triggeradd shard / hotspotPlanwhich ranges moveCopybackground transferCutoverflip metadataDual-writewrite to old + newBackfillcatch up new shardRead newswitchCleanup phase: verify integrity → drop old data → free capacity
Shard rebalance: plan → dual-write → backfill → cutover → cleanup
Advertisement

Dual-write window

Start writing to both old + new shard. Reads still from old. Zero data loss.

Dual-write window

Start writing to both old + new shard. Reads still from old. Zero data loss.

Advertisement

Backfill in background

Copy old data → new shard while dual-writes continue. Rate-limit to avoid impact.

Verify + flip

Compare row counts, sample data. When confident, atomically flip metadata so reads go to new shard.

Read-new + write-both

Interim: reads from new. Writes still to both (in case rollback needed). Monitor.

Cleanup

Once confident, stop writing to old. Delete rows moved. Reclaim capacity.

Dual-write → backfill → verify → cutover → cleanup. Zero downtime, roll-back-able.