I’m really confused. And it seems like a lot of other people are too. How does sharding on creation time (epoch seconds) create hot spots? Aren’t we hashing the ID and thus putting tweets on random shards? Even if we didn’t hash the ID, it would still have to get modulo’d and that would be evenly distributed—one for each shard round-robin style. The only way there would be hot spots is if each shard was range based and filled up sequentially, one after the other. Shard gets full, then a brand new empty shard comes in and starts filling up with the next chronological batch of tweets.
Second question: if we used the combo ID made up of epoch and sequence number, how do we query tweets based on time?
- While reading, we don’t need to filter on creation-time as our primary key has epoch time included in it.
Using your example, we have an epoch timestamp of 1483228800 milliseconds that gets encoded into the first 31 bits of our 48 bit ID. We use 000001 as the sequence number so altogether, the final ID is represented like 1483228800000001. This gets bit shifted and looks nothing like the original. Since the ID gets hashed and all the tweets go to random shards, we have to query every single shard if we want a specific time or time range. How do we then query the ID based on the first section only that contains the encoded timestamp while ignoring the last half that has the sequence number? Wouldn’t the database have to filter by creation time separately? The recommended DB solution was Cassandra which can’t even do that without a full scan.