educative.io

Educative

Question about hash partition in designing typeahead suggestion topics

c. Partition based on the hash of the term: Each term will be passed to a hash function, which will generate a server number and we will store the term on that server. This will make our term distribution random and hence minimize hotspots. To find typeahead suggestions for a term we have to ask all the servers and then aggregate the results. We have to use consistent hashing for fault tolerance and load distribution.

Q: since the hash function generate only one server number and we store the term on that server, what does it mean by that “we have to ask all the servers for typeahead suggestions”?

3 Likes

Thanks for reaching out to us. The author of the course will be answering your query soon.

The last couple of sentences describe, how we will “query” for the typeahead suggestion.

While writing, we pass the term to the hash function to get a server and store it there.

While reading, we have to query all servers because we don’t know what servers will have terms for a certain prefix. In other terms, we can easily find where a term, say “design”, will be stored but we can’t find all servers storing terms that start with “de”.

Hope this answered your question.

2 Likes

I think this point could be made less confusing if we differentiate a “term” from a “prefix”, i.e. to find typeahead suggestions for a “prefix” we have to ask all the servers.

Hi @Design_Gurus,

This partitioning scheme seems good from a general server load and avoiding hotspots perspective.
But, won’t it affect the efficiency of the individual Tries themselves?

For eg. suppose we have 2 Tries, and they are partitioned according to the scheme 2
(a) storing “cap, captain, capital, caps, cat, curiosity”
(b) storing “sweet, sweat, swear, sugar, syrup, serene”

These are efficient in themselves for performing searches, because they can be compressed. Searches like “cap” are lightning fast on the first Trie. While the second Trie is extremely good at performing searches like “swe”.

If we use the randomization by hash scheme we could end up with something like
(a) “cap, sweet, cat, sugar, capital, sweat”
(b) “syrup, serene, captain, caps, swear, curiosity”

We can’t optimize any of these Tries for individual searches (they will be less compressed plus more tree walkdown time in terms of number of nodes. As avg depth increases, so does the time complexity).
Combine that with the time delay required to aggregate the results, and we are looking at a nightmare here.

What do you think could be done to improve this?

@Yash_Nasery
I think you are right about the shortcoming of this partitioning approach.
But I think what we are expected to do at this point is to disccuss the pros and cons with the client rather than just presenting the best approach . so this approach would be a good to have discussion which will lead us to a better approach - " b. Partition based on the maximum capacity of the server" in my opinion.

@himani_agarwal, yes that’s the same thing I thought too.

about this one umm, if you tell the “client” (quotes because there might not be any - maybe we just have end users), these 2 approaches, it might just confuse them as to which one to use (again that’s just my opinion).
Maybe a scheme which combines the two approaches or something might work better. What do you think?

@Yash_Nasery I think Partition based on the maximum capacity of the server will solve our problem since the only problem with it highload/hot partition can be resolved with consistent hashing. System like cassandra are already providing effectively.

About the hybrid approach, I myself am not able to think of how we could use both word and tweetid to counter the hotspot or latency problem… but I am very intrested to know how can it be done.

hotspot cannot be solve @himani_agarwal. See if a word like “CAP” is becoming hot, then since its hash is same, always request is going to same server.So max capacity based partitioning (b) is better here because we know which servers to hit instead of hitting all servers.
Let us say cap becomes hit and we are getting million of requests, if we hit one server we can get all data from there but if we distribute, we need to make parallel calls to each different server,still we need to make 1 million parallel calls,although data returned by each is less.But that is a far use case.SO i will prefer max server capacity based sharding

1 Like

In the second approach , what about the case in which the partition becomes full and then a new search query come how are we going to proceed since partition is full. do we assign another partition to it and add entry in LB like.

e.g “CAP” all terms will go to Partition 1. Now if Partition become full. we assign this query to partition2 and add entry in LB like “CAP” : partition1 , partition 2. ?

But this is only if we don’t store the top K data within each node. If we do store the top K Data within each node, then we don’t need to query all servers. This is my understanding