educative.io

Virtual Replicas in Consistent Hashing

Hi - I don’t understand the virtual replica concept in Consistent Hashing. I have pasted the relevant portion from that topic below. Please could you explain this.

"For load balancing, as we discussed in the beginning, the real data is essentially randomly distributed and thus may not be uniform. It may make the keys on caches unbalanced.

To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to a single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way, each cache is associated with multiple portions of the ring."

2 Likes

Hi Ravi,

I’m Arqam from Educative. Happy to hear from you.

We will get back to you as soon as possible.

Regards,

Arqam

When are you getting back to him?

Are you a bot? No reply but just a spam message.

Consistent Hashing uses ‘replication’ of data to distribute traffic load and handle node failure scenarios. There are many ways to implement replication in Consistent Hashing. One way will be to keep a copy of the data on the ‘next’ servers on the ring (or to any random server). In such cases, the key or the hash is mutated in a predictable manner to find these secondary servers (or virtual replicas). There exist many intelligent ways to select multiple nodes from one key.

Take a look at the post, especially the ‘Replication’ part: https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8

Hope this helps.

In order to create Virtual Replicas, we would generally use more than 1 hashing function to place the nodes across the ring. Lets start from beginning, say we have a hash function h(x), where x is the IP of the node. We get the location in ring by h(x)%m, where m is the upper limit on the output of hash. So, this will create the ring and then requests can be mapped to successors.
Now, this can easily get skewed because we might have less servers and hence we need to introduce virtual replicas.
Virtual Replica is nothing but instead of using h(x), use couple of hash functions say h(x), h1(x), h2(x). So, these three hash functions will give you three distinct point in the ring and now 1 node is replicated thrice and thus your ring has many points which will in turn reduce load skewness.
There are more logic to this then just hashing. For e.g. If you are using DHT for storage (cassandra), you might want to replicate data across successors.
Hope i was able to clarify

2 Likes

I agree that the original article is a bit confusing. Primarily because of the ‘cache’ word. Probably, it would be better to refuse using the “caching servers” analogy, and call them ‘servers’ or ‘nodes’. After all, the Consistent Hashing strategy is way bigger than just the approach to cache something.