How does creating a tweet id by using epoch time and auto sequence number will make read fast?
In this case also, we have to query all the servers but how the queries will be fast?
Thanks
How does creating a tweet id by using epoch time and auto sequence number will make read fast?
In this case also, we have to query all the servers but how the queries will be fast?
Thanks
same question here
same question here
- While reading, we donāt need to filter on creation-time as our primary key has epoch time included in it.
why having the epoch time in the primary key avoid the needs for filtering by creation time? when we have a created_at field in each database i can see how we would ask for the most recents tweets, but having the epoch time in the key only helps our write performance and read load distribution in my view. i canāt filter using the epoch time in the ID when doing a read no?
Am i missing something? @Design_Gurus
What it means is that since we donāt have to to filter on āanotherā column (other than the primary key) therefore the reads will be much faster. We will still have to filter on the primary key (which has the epoch time) but it will be a lot faster than filtering on the ācreation timeā.
Imagine the scenario where we are filtering on the ācreation timeā, this means we will have a secondary index on ācreation timeā. Now compare it with āfiltering only on the primary indexā, which one will be faster? "Range queries are always faster on the primary index. This is due to the fact that, the underlying data is organized based on the primary index (and not on secondary indexes).
This post has some good details on primary and secondary indexes: https://www.quora.com/What-is-difference-between-primary-index-and-secondary-index-exactly-And-whats-advantage-of-one-over-another/answer/Siddharth-Teotia
@Design_Gurus But then we still need secondary index on user_id, right?
Because I guess our query would be a range query but then we need to filter by user_id(list of users being followed by the user for which we are creating the newsfeed).
When user check others tweets, our sever should search by user first, which give us lots of tweets ids, then we search these tweets. So we donāt need user_id.
I really doubt if that would be a good solution every time for every user we want to build the timeline.
@Design_Gurus We should be using user_id for the secondary index to retrieve the tweet ids only for a particular user. Can you please check again this design?
@Design_Gurus could you please clarify this point? Iām also stuck on this: weād be filtering on user_id as well in each shard right?
In all the sharding approaches discussed, since our end goal is to find tweets of a specific user, therefore, we have to filter on UserID. Our sharding approach lets us either jump to the exact server where the data is stored or we have to fan out to all servers and then combine the results. Take the examples of the following two approaches discussed in the chapter:
Sharding based on UserID: In this case, we will filter on UserID and sort on CreationTime to get the top tweets of each user. In this approach, we can directly jump to the server where the required userās data is stored. We will still have a primary key for each Tweet to uniquely identify it, and then we will have secondary indexes on UserID and CreationTime. We didnāt discuss it in the chapter but we can adopt a similar epoch-time approach (discussed under 'Sharding by TweetID and Tweet creation time), this will save us the secondary index on creation time as we will be sorting on the primary key (i.e., epoch time).
Combine sharding by TweetID and Tweet creation time: Under this approach, we will query all servers, where each server will filter on UserID, and sort on the primary key (i.e., epoch time). Later, we will combine (and sort again) results on aggregator servers.
Thanks for the description @Design_Gurus. I guess I didnāt fully appreciate how the primary key itself is by definition an index. Option 2 you mention cleverly uses this property of the primary key instead of adding more write latency through an extra secondary index. Makes sense.