Why not using HBase instead of Cassandra?

Since we are already in the Hadoop ecosystem (using HDFS), why not use HBase directly. Introducing Cassandra seems to bring extra complexity.

Lesson: Detailed Design of the Typeahead Suggestion System - Grokking Modern System Design Interview for Engineers & Managers


Hello Jiao Peng,
Thanks for asking the question.

If we only consider the functional requirements, you are right that we could have used Hive as well. Though to meet our non-functional requirements in a better way, we have preferred Casandra over HBase. Following are the details.

A system like typeahead suggestion requires a highly scalable and efficient database that can manage enormous amounts of data and provide fast read and write operations. HBase and Cassandra are the popular NoSQL databases that are frequently used for such use cases. The choice between these two databases, nevertheless, might be influenced by a number of factors. For example, in the typeahead suggestion system, we aim to achieve better scalability and fault tolerance, due to which Cassandra is a suitable option here, irrespective of one-time implementation complexities.

In the typeahead suggestion system, our aim is to take data from HDFS, distribute it over different Map-Reduce workers and store the data in a No-SQL database which should be highly scalable and fault tolerant. In the design, we have the aggregator that decouples the HDFS and the Cassandra database. The choice of Cassandra is due to the following reasons:

  • HBase is built on top of the HDFS and uses a primary-secondary architecture, while Cassandra is built on a peer-to-peer architecture. Cassandra can be integrated with Hadoop using Hadoop’s Map-Reduce framework. This means that HBase can support more complex data processing, while Cassandra provides high scalability and fault tolerance. So, Cassandra is a suitable option to meet our non-functional requirements.

  • Cassandra has a flexible schema model, which allows for changes to be made to the schema without any downtime. This is important for typeahead suggestions, as it often requires frequent updates to the underlying data. Cassandra is designed to be distributed across multiple nodes and data centers, making it highly scalable and fault-tolerant. This makes it easier to handle large volumes of data and ensures high availability and data consistency.

I hope this helps!

Thank you.

1 Like