educative.io

Design Tiny URL: MongoDB or Cassandra/Riak?

Hope everything is going well. I have a concern regarding the statement in the tutorial that says “a NoSQL store like DynamoDB, Cassandra or Riak is a better choice.”

While I have no doubt that NoSQL is more suitable than SQL for the tinyURL service, I feel I disagree regarding to use leader-less replication DBs like DynamoDB/Cassandra/Riak. I would instead to propose to use a single-leader DB like MongoDB. Here I’ll list my reasons. Please feel free to let me know what do you think.

  1. On Read/Write RPS

tinyURL service has very high read RPS, but relatively lower write RPS. Single-leader DBs like MongoDB serves higher read throughput as one can read from either the following replica or the leader replica, despite writes has to go through the leader replica. Leaderless DBs like Cassandra typically provides faster writes and slower reads, as read-repair is needed during reading.

  1. Concurrent Writes

Handling concurrent writes from two users trying to grab the same tinyURL is crucial for tinyURL service, and failing to handle this can result in write skews or phantom reads.

Leader-less DBs like Cassandra/Riak has weaker atomicity guarantee upon concurrent writes. For Cassandra, in such a scenario, Cassandra will perform Last-Write-Wins to resolve conflict. Cassandra does allow “IF NOT EXIST” clause in its INSERT statement, but that will require visiting multiple replicas during writes which slows down writing performance.

Single-leader systems like MongoDB may have better transaction/atomicity support. For MongoDB, every operation on one document is atomic, and if one tries to insert a record whose _id already exists, a duplicated key error will be returned. In case of two users trying to insert a document under the same key, because of the per-document-atomicity guarantee, the slightly earlier user will win while the slightly later user must wait. Then when the slightly later user wants to write, the duplicated key error will be returned. This seems to guarantee that we will never store colliding tinyURLs in the DB.

  1. Voiding the need of KGS?

Using MongoDB, the need of a KGS seems mostly voided because concurrent writes can be better handled as described above? Upon receiving a duplicated key error, we can simply re-hash the previous tinyURL and try again.

I’m quite strongly against the idea of KGS unless I missed something, but KGS sounds like just a nicer version of “materializing conflicts” in order to fix write skews or phantom reads. “materializing conflicts” works, but is usually strongly unfavored unless there are no other alternative working solutions. The reason is that it mixes application tier business logic with data tier constraints, and is one step backward from the basic design principles of decoupling and separation of concern.

I’m wondering if I missed something? Please feel free to point out and correct me if I’m wrong. Thanks!

  • Wilson
2 Likes

Agree on all points except the last.

KGS could be where we write business logic to generate keys and offer keys. Maybe we want readable random characters, and we want to offer them to paying customers. Or we want to filter out bad words. Or the data science team found that having 3 numbers in the url increases clicks, so we want to generate more of those.
Of course, this has nothing to do with maintaining data consistency (the reason the author discussed having the KGS), but design wise, I do think it is a good idea to have.

Thanks for the opinion. But I’d like to clarify that what I’m against is not necessarily KGS itself, but a KGS that is backed by a KGS store. If KGS is stateless, I’m fine with that.

Essentially, with a KGS store, we have two databases, one is the KGS store, one is the tinyURL store. These two data stores are not independent, and states across these two stores need to be consistent. Trying to maintain this consistency is basically re-inventing distributed transactions, which is very hard to get right by application dev engineers, and the idea of distributed transactions had been proven to be not fault-tolerant, because it requires all involved nodes to be up-and-running, which means its fault-tolerance is worse than a single node which only requires one node to be up-and-running. This is what I meant by saying KGS impacts data consistency.

If the KGS is just a stateless service (think of it has a cloud function), which the other services depends on, there is no data consistency impact. But once we make another store to keep pre-generated keys, we are opening a can of worms.