Question Regarding : Partition based on the hash of the term

Saurabh · July 27, 2020, 2:29pm

In Partition based on the hash of the term approach, let’s say 2 requests for CAP and CAPITOL are coming, as the hashed value of both the terms will be different so the request can go to 2 different servers. This means trie starting with CAP needs to be stored in both the servers.

Basically, we will need to store complete trie in all the servers, and search throughput will be equally divided based on hash key.

Pleasse correct me if I am wrong.

Design_Gurus · June 11, 2019, 8:18pm

Hi @Saurabh,

Thanks for posting the question.

First thing first, you are right ‘CAP’ and ‘CAPITOL’ will have different ‘hash’ so they will be stored on two different servers. Now, if a query comes for ‘CA’ (or ‘CAP’), as mentioned in the chapter, the query has to go to all the servers, and then we have to aggregate the results:

To find typeahead suggestions for a term we have to ask all the servers and then aggregate the results.

So, we will not be storing the whole Trie on all the servers - actually, we can’t store the whole Trie on a server that is why we are partitioning!

As you can understand querying all servers has disadvantages and can have failures issues like some server is down, slow or hot. We need to present these pros/cons of each approach in the interview. For each ‘con’ try presenting a solution, and develop on it.

Hope this clarifies your question.

Saurabh · June 12, 2019, 3:41am

Thanks for the explanation. I was confused about how are we partitioning the trie and storing across all servers.
It is clear now.

007_win · July 8, 2019, 5:35am

Regarding twitter design, https://www.educative.io/collection/page/5668639101419520/5649050225344512/5738600293466112, while discussing about “Sharding based on words”, in the last line it says that Consistent Hashing can be used to overcome the issue of “Search Hot words”(people are searching the words frequently) and “Status Hot Words” (People are using the word in the status frequently). My question is how can Consistent Hashing technique help here?
In consistent hashing if we hash the word to get an index on the ring since hash function does not get changed if multiple people search the word will end up same server until and unless we are adding any other attribute to hash function. So, how the “Search Hot words” problem gets solved?
Similarly, in key-value pair where key is word and value are statusIDs. if any key gets popular in statuses will result more statusIDs in values. Again how consistent hashing will help here?

Design_Gurus · July 9, 2019, 11:06pm

Hi @007_win,

Consistent Hashing can use ‘replication’ of data to distribute read loads. There are many ways to implement replication in Consistent Hashing. One way will be to keep a copy of the data on the ‘next’ servers on the ring or to any random server. In such cases, the key or the hash is mutated in a predictable manner to find these secondary servers. There exist many intelligent ways to select multiple nodes from one key.

Take a look at the post, especially the ‘Replication’ part: https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8

Hope this helps.

Neil · March 3, 2021, 4:30am

To quey “CA” on all server and then aggregate them.

Does this mean we have a smaller trie on each server?

Design_Gurus · March 9, 2021, 9:36pm

Yes, each server will store a part of the tree.

rahul9 · March 12, 2021, 6:39pm

@Design_Gurus ,in almost all design chapters you said we will use aggregation like get tweet by use, but here aggregation become a con?Why so?

Athena · May 26, 2021, 2:02am

@Design_Gurus , What if for read we create a hash table by an in-memory storage like redis, so that the key will be prefix and the value will be top K suggestions. In this way for read we don’t need to hit the trie servers and by hashing the prefix we can directly read top K suggestions (no need for aggregator for read). Only at write we update tries at some intervals and then update the hash table in redis.