Unclear on why bucket pagination is better than skip and limit for large datasets

Amit_L_Wadhwani · July 13, 2023, 3:36am

Hi i’m reading the section on pagination in designing a search service api. And i’m a little confused why the bucket pagination is better for large data sets than the skip and limit. I also think i’m unclear on what exactly we’re gaining from bucket pagination.

My understanding is that for skip and limit you skip 1 record at a time and the page size can be variable. That seems the most flexible. With the bucket pagination it seems like we have to predefine a page size and we can skip 1 page at a time. I don’t get the advantage. we can do that in skip and limit as well i’m thinking but maybe i’m confused. Can someone help?

Course: https://www.educative.io/courses/grokking-the-api-design-interview
Lesson: https://www.educative.io/courses/grokking-the-api-design-interview/JQjM6Z9Dk1y

lfrah_Dar · July 18, 2023, 9:38am

Hi @Amit_L_Wadhwani,

Your understanding is correct, but let me explain more clearly the difference between these two methods and how they work on the backend.

Both methods send the page number to the front end. However, the main difference is how they handle the data on the backend.

Skip and limit pagination involve skipping one record at a time, and page sizes may vary. While this approach provides flexibility, it can become inefficient when dealing with large datasets. For example, if your search query has 2000 records and each page displays 100 results, retrieving page 4 would require the backend to iterate through 300 unnecessary results. This slows down the process.

Bucket paging, on the other hand, divides data into predefined buckets or segments, each containing a fixed number of results, say 20. Bucket pagination allows you to skip entire buckets at once instead of individual records. This approach reduces unnecessary iterations and improves efficiency, especially when dealing with large datasets.

To illustrate this, you can think of skip and limit pagination as simple search, and bucket pagination as analogous to binary search. Skipping records in skip and limit pagination is similar to searching one by one, while skipping buckets in bucket pagination is like using binary search method, which can navigate data faster and more efficiently.

In summary, bucket paging has advantages in terms of efficiency and reducing iterations for large datasets. It allows backends to skip entire buckets instead of individual records, improving performance.

I hope this explanation address your confusion. Let me know if you have any further questions!

Xuan · April 10, 2024, 1:19pm

@lfrah_Dar I still cannot get it. In the book, it mentions

The bucket pattern divides search results into groups of fixed size.

Which “backend” component in the given design will implement the bucket pattern?
Do we need any change to the database schema to use bucket pattern?
What are the tradeoffs of bucket pattern, such as data redundancy or else?

What happens to the buckets when we have different sorting and filtering for the same query string?

Does the system in the book search for exact query string or a combination of keywords in any order?
For example, if we want to search “data structure”,
do we return only results with exactly the string “data structure”?
Do we return a document which only has the word “data” without the word “structure”?
How does the bucket pattern work if we now implement a fuzzy search feature?

Honestly, it would be easier to understand if you could provide a specific example with responses in each step while data is traveling through each component.

lfrah_Dar · April 15, 2024, 8:27am

Hi @Xuan,

You’ve asked some interesting questions, but in our course, we focus more on explaining how communication happens rather than getting into the nitty-gritty of the search service implementation. We skip the detailed implementation stuff to keep our focus on teaching how clients talk to the backend services. Although, I’ll try to answer all your queries:

Q: Which “backend” component in the given design will implement the bucket pattern?
Search service implements the bucket pattern.

Q: Do we need any change to the database schema to use bucket pattern?
In most cases, you won’t need major schema changes for bucket pagination. You might add fields to documents that indicate the bucket they belong to (e.g., a “bucket_id” field). However, the core data structure usually remains the same.

Q: What happens to the buckets when we have different sorting and filtering for the same query string?
To handle buckets with different sorting/filtering:

We can create separate sets of buckets for each combination of sorting/filtering options. This increases storage requirements but improves performance for specific queries.
Re-compute buckets on the fly based on the requested sorting/filtering. This can be less efficient but avoids redundancy.

Q: What are the tradeoffs of bucket pattern, such as data redundancy or else?
Bucket patterns can introduce some data redundancy, especially if sorting criteria change frequently. We can address this issue by opting any option mentioned in the previous answer

Q :Does the system in the book search for exact query string or a combination of keywords in any order?
Our service focuses more on delivering quick and relevant results than exact matches.

Q: Do we return a document which only has the word “data” without the word “structure”?
It returns a document containing both the terms, regardless of the order. The main condition is that both words are mentioned in search results.

Q: How does the bucket pattern work if we now implement a fuzzy search feature?
Generally, buckets are created based on specific criteria, typically for exact matching. For fuzzy search, we need to implement additional algorithms to extend search capabilities beyond exact matches. This involves techniques like measuring closeness through edit distance calculations and incorporating phonetic encoding to identify similar-sounding words.

Lastly, you’re right about explaining details with examples. However, if we provide more details regarding implementation, beginner-level learners might get confused by these intricacies. Therefore, for simplicity, we intentionally hide implementation details.

Xuan · April 15, 2024, 10:04am

@lfrah_Dar, thank you a lot for your answer!

We can create separate sets of buckets for each combination of sorting/filtering options.

Do you mean to create different database views for known and common sorting/filtering options?

lfrah_Dar · April 16, 2024, 8:03am

@Xuan yes.