educative.io

Educative

Choosing Cassandra over others. Why? And some general doubts -Designing Instagram

It is given that "We need to store relationships between users and photos, to know who owns which photo. We also need to store the list of people a user follows. For both of these tables, we can use a wide-column datastore like Cassandra. "

  1. The point of why cassandra is chosen others is not clear? Why not other NoSQL dbs?

  2. For tinyurl problem, there is a table of userid and url that user created. There is a relation. But the text says "Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra Riak is a better choice. " . Why is 1) and 2) give 2 different reason for using cassandra?

  3. What is the clear way of choosing NoSQL over SQL ? I read the blog post about the difference, but there is no concrete point that mentions which one to use over another.

  4. When to use cassandra, mongodb or other NoSQL store?

2 Likes

2)I believe we dont need a relation database because your use cases don’t actually require you to do any queries requiring joins. Ie ur not querying the db for all urls for a particular user or something of that sort.
concerning your other questions, I honestly have the same questions

I have a similar question - Why is Cassandra chosen for Instagram? I have no experience with wide column DBs so would appreciate any pointers. e.g. i would have gone for regular NoSQL, just different collections. Is the choice purely for I/O efficiency?

I guess the reason to choose cassandra over other DBs is that cassandra provides higher availability over others like mongoDB because it uses peer to peer architecture and mongoDB uses master-slave. In peer to peer if one node goes down, the system can still take reads and writes, where as in the master slave architecture, if the master goes down the system can take reads but for the writes a slave needs to be promoted to master which takes some x time and for that x time system would be unavailable for the writes.

I have the same doubts… This series keep rotating the technology for different topics. This makes it more confusing.

  • Your all questions are valid but unfortunately no proper answers.

@Design_Gurus could you please elaborate on the choice of Cassandra? It is also not clear if the “key-value” store mentioned refers to Cassandra or is that the third database. In addition to the above, Cassandra store data column wise, not row wise - so if we need to fetch all photos of a user, and each tweet is in a different column, the READS will be slow.

We didn’t really propose to use Cassandra, instead, we presented a few solutions. Lets first go through them and then see if any specific solution stands out:

  1. First we presented an RDBMS approach (e.g., MySQL). All metadata goes in the tables (schema is given) and the actual photos either go in a distributed storage like HDFS or cloud like S3. Since we do need joins, using RDBMS seems like an easy choice but all these DBs come with their challenges like when we need to scale or reliability (compare it to Cassandra which does quorum-based read/write and hence their reliability/performance is measured differently). Having said this, these issues of RDBMSs are manageable (though difficult specially when we are building a global service like Instagram), for example, Facebook is storing most of their social graph in MySQL and they have scaled it well. Probably Facebook has the world’s largest MySQL deployment.

  2. Any NoSQL can work too. But we will need to store the relations (or “joins”) too. For example, to find “Followers”, we will need to store “follower” and “followee” in a key-value pair. This could be any NoSQL like Redis, Amazon’s DynamoDB, etc. Here are the top key-value data-stores: https://db-engines.com/en/ranking/key-value+store

  3. Cassandra could be a qood fit here. For example, we can store all the “Followers” in separate columns for a “Followee”. A column store, in our case, will give good performance but it has less flexibility (ref:https://en.wikipedia.org/wiki/NoSQL#Performance). For example, Facebook (the original developer of Cassandra) has nearly stopped using Cassandra because of its complexity and flexibility. Facebook has developed their own key-value store, it is called ZippyDB (find their presentation here: https://engineering.fb.com/core-data/inside-data-scale-2015/)

First thing first, although we have not given a clear answer, the above discussion is very relevant for a System Design Interview. You should focus on this, it is very important! In an interview, presenting different options and knowing their trade-offs is quite important.

Finally, it looks like Cassandra or a simpler key-value store could solve our problem efficiently. But, hey, Facebook is storing their social graph in an RDBMS!