The Problem with Modulo Hashing
Imagine a cluster with 4 nodes. Key 'A' hashes to 10. 10 % 4 = 2, so it goes to Node 2. If we add a 5th node, 10 % 5 = 0. The key moves to Node 0. In fact, nearly N/(N+1) of keys will move.
Advertisement
Enter Consistent Hashing
Consistent Hashing solves this by decoupling the data partition from the number of physical nodes. It maps both data and nodes onto a circular "ring" (conceptually 0 to 2^32 or similar).
How it Works
- The Token Ring: The output range of the hash function is treated as a fixed circular space.
- Node Placement: Each node is assigned a random value (token) on this ring.
- Data Placement: A key is hashed to a value on the ring. We walk clockwise to find the first node with a token greater than the key's hash. That node owns the key.
Advertisement
Virtual Nodes (VNodes)
Basic consistent hashing can still lead to "hot spots" if nodes aren't evenly spaced or if one node is more powerful. Virtual Nodes fix this by assigning multiple tokens to a single physical node. This spreads the load more evenly and speeds up rebalancing.
Interactive Visualization: Add/Remove nodes to see how keys redistribute.
Benefits
- Minimal Data Movement: Adding/removing a node only affects its immediate neighbors (or VNode ranges).
- Scalability: You can scale horizontally with predictable performance impact.
- Heterogeneity: Powerful nodes can hold more VNodes.