Why does sharing by TweetID solve the problem of hot users

Sherry_Liu · December 10, 2021, 4:14am

In the article, it mentions:

sharing by UserID:
“What if a user becomes hot? There could be a lot of queries on the server holding the user. This high load will affect the performance of our service.”
Sharding by TweetID:
“This approach solves the problem of hot users, Sharding based on TweetID”

I do not understand why sharding by TweetId solves the problem of hot users. If I understand it correctly, when a user becomes hot, a lot of queries from this user’s followers will query this DB server, but it is still part of the whole traffic. While sharding by TweetID, for all traffic, we need to query all DB servers. Every single DB server has to serve the whole traffic. “Part of the whole traffic” vs “the whole traffic”. So, sharing by TweetID does not reduce the amount of traffic in a single server, it makes things worse. Please correct me If I am wrong

Type your question above this line.

Course: https://www.educative.io/collection/5668639101419520/5649050225344512
Lesson: https://www.educative.io/collection/page/5668639101419520/5649050225344512/5741031244955648

Sherry_Liu · December 14, 2021, 1:08am

Can someone help? I appreciate it. @Design_Gurus

Adam_Almos_Homolya · December 18, 2021, 11:13am

My interpretation is, if you shard by UserID you’ll have all tweets of a particular user handled by a single shard. If this user is Lady Gaga, the server will be overloaded because her tweets are extremely popular. However if you shard by TweetID it’s likely that Lady Gaga’s tweets will be evenly distributed across shards so all servers will get a fair share of traffic. As the article says, the downside of this approach is having to query all shards if you want to fetch the latest tweets. IMO the suggested approach here is rather amateur, in reality we solve this using a more sophisticated sharding mechanism where we dynamically split up key ranges based on load across servers.