educative.io

Not able to undertand , please explain Sharding Question -

How does creating a tweet id by using epoch time and auto sequence number will make read fast?
In this case also, we have to query all the servers but how the queries will be fast?

Thanks

6 Likes

same question here

1 Like

same question here

  1. While reading, we donā€™t need to filter on creation-time as our primary key has epoch time included in it.

why having the epoch time in the primary key avoid the needs for filtering by creation time? when we have a created_at field in each database i can see how we would ask for the most recents tweets, but having the epoch time in the key only helps our write performance and read load distribution in my view. i canā€™t filter using the epoch time in the ID when doing a read no?

Am i missing something? @Design_Gurus

What it means is that since we donā€™t have to to filter on ā€œanotherā€ column (other than the primary key) therefore the reads will be much faster. We will still have to filter on the primary key (which has the epoch time) but it will be a lot faster than filtering on the ā€˜creation timeā€™.

Imagine the scenario where we are filtering on the ā€˜creation timeā€™, this means we will have a secondary index on ā€˜creation timeā€™. Now compare it with ā€˜filtering only on the primary indexā€™, which one will be faster? "Range queries are always faster on the primary index. This is due to the fact that, the underlying data is organized based on the primary index (and not on secondary indexes).

This post has some good details on primary and secondary indexes: https://www.quora.com/What-is-difference-between-primary-index-and-secondary-index-exactly-And-whats-advantage-of-one-over-another/answer/Siddharth-Teotia

6 Likes

thank you @Design_Gurus

@Design_Gurus But then we still need secondary index on user_id, right?
Because I guess our query would be a range query but then we need to filter by user_id(list of users being followed by the user for which we are creating the newsfeed).

3 Likes

When user check others tweets, our sever should search by user first, which give us lots of tweets ids, then we search these tweets. So we donā€™t need user_id.

I really doubt if that would be a good solution every time for every user we want to build the timeline.
@Design_Gurus We should be using user_id for the secondary index to retrieve the tweet ids only for a particular user. Can you please check again this design?

@Design_Gurus,

  1. So to get top tweets,we still need to sort the data after getting from all servers.
  2. if i need to 100 recent tweets, can i just return last 100 rows from database?Will it always work?

@Design_Gurus could you please clarify this point? Iā€™m also stuck on this: weā€™d be filtering on user_id as well in each shard right?

In all the sharding approaches discussed, since our end goal is to find tweets of a specific user, therefore, we have to filter on UserID. Our sharding approach lets us either jump to the exact server where the data is stored or we have to fan out to all servers and then combine the results. Take the examples of the following two approaches discussed in the chapter:

  1. Sharding based on UserID: In this case, we will filter on UserID and sort on CreationTime to get the top tweets of each user. In this approach, we can directly jump to the server where the required userā€™s data is stored. We will still have a primary key for each Tweet to uniquely identify it, and then we will have secondary indexes on UserID and CreationTime. We didnā€™t discuss it in the chapter but we can adopt a similar epoch-time approach (discussed under 'Sharding by TweetID and Tweet creation time), this will save us the secondary index on creation time as we will be sorting on the primary key (i.e., epoch time).

  2. Combine sharding by TweetID and Tweet creation time: Under this approach, we will query all servers, where each server will filter on UserID, and sort on the primary key (i.e., epoch time). Later, we will combine (and sort again) results on aggregator servers.

1 Like

Thanks for the description @Design_Gurus. I guess I didnā€™t fully appreciate how the primary key itself is by definition an index. Option 2 you mention cleverly uses this property of the primary key instead of adding more write latency through an extra secondary index. Makes sense.