Why is sharding by tweet-id better than sharding by word?

Kartikay · November 8, 2020, 10:58am

The author argues that if we shard based on the word, then it could be that some “hot” words could result in that shard becoming larger. At the same time, one could always consider that this index limits the number of tweets for a particular word (for e.g. sorted by tweet popularity, and tweets that are lower in priority are not returned by the search).
In my opinion, the disadvantage when sharding by tweet-id, is that we are optimizing during tweet storage (we know exactly which shard this tweet should be indexed in), with the result that tweet searching is de-optimized (we have to search on all shards). I would have thought that optimizing for search is better in this case.
Any opinions?

rahul9 · February 13, 2021, 7:48pm

agree.We should shard by word itself. A word can potentially be having thousands of tweeid id and it should take same space whether we store in different shards or in same shard, now a word can obviously have a lot of tweet id corresponding to it and i think we should only store famous tweetid(like those have some likes,retweets etc.) .Other users whole might have twitted using same word ,a and say no like or less like and less famous need not to come in at top if the we cross limit of say 2000 tweetid per word.Storing every tweet just does not make sense.

AmSh · February 18, 2021, 3:11am

Yes … search should always needs to be optimized then the write… Since the ratio for wirte here is less then read. You can have more read request then the write request