The Title Says How to Solve Top-K Problems - But The Key Step Is NOT Well Explained

Peter_Lee · April 23, 2023, 7:40am

One challenge I have been facing when reading educative dot io courses is that, sometimes, the most important step (from my perspective) is missing. For example,

In this course, we are trying to “Using sharded counters for the Top K problem”. I was able to understand how we increment/decrement (write) the counter to Cassandra. But how are we gonna get the local Top-K trends? Is Cassandra able to quickly find the Top-K by just one query? (I really doubt, because there are so many counters). If Cassandra can do it, then what’s the data modeling? i.e., what’s the row key, and what’s the column key? how does the query work?

The following is all that mentions Top-k, which is not useful at all:

When users generate a timeline, read requests are forwarded to the nearest servers, and then the persisted values in the store can be used to respond. This storage also helps to show the region-wise Top K trends. The list of local Top K trends is sent to the application server, and then the application server sorts all the lists to make a list of global Top K trends. Eventually, the application server sends all counters’ details to the cache.

Course: Grokking Modern System Design Interview for Engineers & Managers - Learn Interactively
Lesson: Detailed Design of Sharded Counters - Grokking Modern System Design Interview for Engineers & Managers

Peter_Lee · April 23, 2023, 7:38am

My point here is that, if you search “Top K System design” on YouTube, it’s actually a very challenging one. I highly doubt if one Cassandra query would do it.

Course: Grokking Modern System Design Interview for Engineers & Managers - Learn Interactively
Lesson: Detailed Design of Sharded Counters - Grokking Modern System Design Interview for Engineers & Managers

Ibrahim_Nadir · August 3, 2023, 5:59am

Dear Peter,

You are right to point out that finding top K is a challenging problem in a system where things are always changing. In such a system, one thing is clear no matter how hard we try, it is virtually impossible to show users absolutely current value of counters (or current top K). Additionally, other tradeoffs are also involved, such as user-visible latency if we try to collect all the sharded counters on each read query. So we use offline aggregators and excessive caching to get a good performance. Usually, the nature of top K lists is such that there is quite a bit of difference between consecutive entries of top K, and list might update in the order of hours or day that our caching-based mechanism can easily match. Similarly, we didn’t mean to say that there will be just one instance of Cassandra (that will be a single point of failure!), though the overall number of instances for Cassandra won’t be huge because of controlled write and read traffic due to caches. Application servers can get the most recent values of sharded counters from the aggregators. We can render an initial page to the user in our latency budget and then can update the top K list as more data comes in.

We believe the answers to your questions are in the specific lesson you mentioned. We will suggest a re-read.

Thanks.