I see couple of problems in the last
-
You said we will shard by word hash and then it can create hot word problems so we use consistent hashing.
So, the problem is consistent hashing will not help you to overcome hot word problem, if a word is hot, then even with consistent hashing,it will go only to one shard, you cannot avoid it. -
Now even if you shard by tweet id, still a word can be hot and sharding by id does not solve it.
SO this is a confusion.
Now what i think is, that hashing by word as key is a problem because if we dynamically add or remove servers, the unbalancing will come and that why we use consistent hashing, The main other reason is that some words are used more than others and if we store all of them in one server, the the server can get unbalanced load. So it cannot be handled by consistent hashing and that why we shard by tweetid, because if a word is in 100 tweets that possibility is that the tweet id is unique and all tweets goes to different shard