Sharding by PhotoIDs

SubbuY · February 21, 2021, 10:26pm

In the Partitioning based on PhotoID section, if we can generate unique PhotoIDs first and then find a shard number through “PhotoID % 10”, the above problems will have been solved.

First, this system is read heavy. By sharding based on PhotoID % 10, the photos are distributed to different shards. This means, to display the photos to the user who wants to view his\her photos or to the followers, there should be an intermediary operation to do scatter\gather approach before updating in cache. Isn’t storing the photos by partitioning UserID is better approach? This way the ready is much faster. For users with more photos they will not be part of this sharding scheme.

Harvinder_Bholowasia · March 25, 2021, 8:13am

Completely agree, creating newsfeed with photoId partitions will be slow. You need the search by user during feed generation… sorted by timestamps

ys11 · July 3, 2021, 4:47pm

You are right.By sharding based on PhotoID, we should query all the shards to get the followers’ metadata , that is very inefficient.

Pankaj_S · October 13, 2021, 6:01pm

When designing Twitter, aggregation was done to collect the twitters of the user in case of sharding by twitterid so not sure why it will not be required when doing sharding by photoid over here as well. It seems that solutions are giving by different contributors. The only difference I can think is that we are talking abt metadata here but if its metadata, its small and userid sharding is better approach. Need some clarification.

Akash_Jain1 · October 18, 2021, 11:28am

Two approaches to look at it IMHO:
1.First approach: Maintain separate denormalized/materialized view for different query patterns. Here, we can have 1 denormalized view with partitioningkey as userId and other with partitoning key as photoId. First view will be used for query pattern pivoted on user, second one shall be pivoted on photoId itself. Pls note its perfectly acceptable in nosql DBs to duplicate data to optimize read queries.

2.Second approach : Choose partitioning key intelligently , by keeping a balance between 2 aspects - size of a shard v/s number of shards to read for a read query.
Choose userId as the partitionkey combined with hash of the photoId % 4 , this will ensure data is distributed roughly in 4 partitions and will also limit the reads to MAX 4 partitions!