Hope everything is going well. I have a concern regarding the statement in the tutorial that says “a NoSQL store like DynamoDB, Cassandra or Riak is a better choice.”
While I have no doubt that NoSQL is more suitable than SQL for the tinyURL service, I feel I disagree regarding to use leader-less replication DBs like DynamoDB/Cassandra/Riak. I would instead to propose to use a single-leader DB like MongoDB. Here I’ll list my reasons. Please feel free to let me know what do you think.
- On Read/Write RPS
tinyURL service has very high read RPS, but relatively lower write RPS. Single-leader DBs like MongoDB serves higher read throughput as one can read from either the following replica or the leader replica, despite writes has to go through the leader replica. Leaderless DBs like Cassandra typically provides faster writes and slower reads, as read-repair is needed during reading.
- Concurrent Writes
Handling concurrent writes from two users trying to grab the same tinyURL is crucial for tinyURL service, and failing to handle this can result in write skews or phantom reads.
Leader-less DBs like Cassandra/Riak has weaker atomicity guarantee upon concurrent writes. For Cassandra, in such a scenario, Cassandra will perform Last-Write-Wins to resolve conflict. Cassandra does allow “IF NOT EXIST” clause in its INSERT statement, but that will require visiting multiple replicas during writes which slows down writing performance.
Single-leader systems like MongoDB may have better transaction/atomicity support. For MongoDB, every operation on one document is atomic, and if one tries to insert a record whose _id already exists, a duplicated key error will be returned. In case of two users trying to insert a document under the same key, because of the per-document-atomicity guarantee, the slightly earlier user will win while the slightly later user must wait. Then when the slightly later user wants to write, the duplicated key error will be returned. This seems to guarantee that we will never store colliding tinyURLs in the DB.
- Voiding the need of KGS?
Using MongoDB, the need of a KGS seems mostly voided because concurrent writes can be better handled as described above? Upon receiving a duplicated key error, we can simply re-hash the previous tinyURL and try again.
I’m quite strongly against the idea of KGS unless I missed something, but KGS sounds like just a nicer version of “materializing conflicts” in order to fix write skews or phantom reads. “materializing conflicts” works, but is usually strongly unfavored unless there are no other alternative working solutions. The reason is that it mixes application tier business logic with data tier constraints, and is one step backward from the basic design principles of decoupling and separation of concern.
I’m wondering if I missed something? Please feel free to point out and correct me if I’m wrong. Thanks!
- Wilson