Not able to undertand , please explain Sharding Question -

Himanshu_Sangal · July 27, 2020, 2:28pm

How does creating a tweet id by using epoch time and auto sequence number will make read fast?
In this case also, we have to query all the servers but how the queries will be fast?

Thanks

Franklin · May 2, 2019, 1:34am

same question here

Alfredo_Motta · July 12, 2019, 6:30am

same question here

While reading, we don’t need to filter on creation-time as our primary key has epoch time included in it.

why having the epoch time in the primary key avoid the needs for filtering by creation time? when we have a created_at field in each database i can see how we would ask for the most recents tweets, but having the epoch time in the key only helps our write performance and read load distribution in my view. i can’t filter using the epoch time in the ID when doing a read no?

Am i missing something? @Design_Gurus

Design_Gurus · July 12, 2019, 9:36pm

What it means is that since we don’t have to to filter on “another” column (other than the primary key) therefore the reads will be much faster. We will still have to filter on the primary key (which has the epoch time) but it will be a lot faster than filtering on the ‘creation time’.

Imagine the scenario where we are filtering on the ‘creation time’, this means we will have a secondary index on ‘creation time’. Now compare it with ‘filtering only on the primary index’, which one will be faster? "Range queries are always faster on the primary index. This is due to the fact that, the underlying data is organized based on the primary index (and not on secondary indexes).

This post has some good details on primary and secondary indexes: https://www.quora.com/What-is-difference-between-primary-index-and-secondary-index-exactly-And-whats-advantage-of-one-over-another/answer/Siddharth-Teotia

Alfredo_Motta · July 18, 2019, 6:10am

thank you @Design_Gurus

lovekesh_bansal · May 12, 2020, 12:43pm

@Design_Gurus But then we still need secondary index on user_id, right?
Because I guess our query would be a range query but then we need to filter by user_id(list of users being followed by the user for which we are creating the newsfeed).

Toffee · May 16, 2020, 12:19pm

When user check others tweets, our sever should search by user first, which give us lots of tweets ids, then we search these tweets. So we don’t need user_id.

lovekesh_bansal · May 28, 2020, 3:27am

I really doubt if that would be a good solution every time for every user we want to build the timeline.
@Design_Gurus We should be using user_id for the secondary index to retrieve the tweet ids only for a particular user. Can you please check again this design?

rahul9 · November 25, 2020, 10:38pm

@Design_Gurus,

So to get top tweets,we still need to sort the data after getting from all servers.
if i need to 100 recent tweets, can i just return last 100 rows from database?Will it always work?

jimothy · September 19, 2021, 9:44pm

@Design_Gurus could you please clarify this point? I’m also stuck on this: we’d be filtering on user_id as well in each shard right?

Design_Gurus · September 20, 2021, 12:16am

In all the sharding approaches discussed, since our end goal is to find tweets of a specific user, therefore, we have to filter on UserID. Our sharding approach lets us either jump to the exact server where the data is stored or we have to fan out to all servers and then combine the results. Take the examples of the following two approaches discussed in the chapter:

Sharding based on UserID: In this case, we will filter on UserID and sort on CreationTime to get the top tweets of each user. In this approach, we can directly jump to the server where the required user’s data is stored. We will still have a primary key for each Tweet to uniquely identify it, and then we will have secondary indexes on UserID and CreationTime. We didn’t discuss it in the chapter but we can adopt a similar epoch-time approach (discussed under 'Sharding by TweetID and Tweet creation time), this will save us the secondary index on creation time as we will be sorting on the primary key (i.e., epoch time).
Combine sharding by TweetID and Tweet creation time: Under this approach, we will query all servers, where each server will filter on UserID, and sort on the primary key (i.e., epoch time). Later, we will combine (and sort again) results on aggregator servers.

jimothy · September 22, 2021, 3:56am

Thanks for the description @Design_Gurus. I guess I didn’t fully appreciate how the primary key itself is by definition an index. Option 2 you mention cleverly uses this property of the primary key instead of adding more write latency through an extra secondary index. Makes sense.