Does Key Generation Service even solve any real problems without introducing new ones?

Kedar_Patwardhan · July 27, 2020, 2:28pm

The random key generator is used to prepolutate the DB with blank keys that don’t have any source URLs. However, I don’t understand whether they solve any real life problem.

When inserting a new key/source URL pair in the database, we need to first find the key location to insert, which is a “lookup” operation in the DB (most likely some BTree). After looking-up the key, it’s value would be modified to point to the source URL. Now, whether the key preexisted as a blank key or not does not save us any IOPS on the DB server.
Also, an implication of storing blank keys in the DB implies we cannot check whether a user is adding duplicate key/source URL entries and prevent it. This is so because since the key is not derived as a hash of the source URL, we cannot check if the key/source URL mapping already exists. Now, I agree that the cost of storage is low enough that we can accept duplicates from the same user but if we don’t have any extra cost to implement this feature, why not ?
Given that we don’t need a cryptographically secure hash of the source URL to compute a key from it, the cost of a murmur3 hash of the source URL is very low, typically 150/200 nano seconds. I think this low cost makes the random key generator quite useless or in-fact disadvantageous since we lose the feature mentioned in (2).
With murmur3 hash, there would be collisions but the probability is very low. Consequently, when inserting the key/source URL in the BTree, a simple check for preexisting key in the DB would allow us to modify the key should there be a collision. The way the key would be modified in such a case, is to create new key by adding a byte to the byte stream of the original key till we get a new unique key.

So then again, what real life problems are being solved by the KGS and is it even worth having one ?

Ave · November 5, 2019, 7:43am

I also thought about this for quite a while and from my perspective, having a KGS makes more sense.

For the argument in #2, there are use cases where you don’t want to return the same key for duplicate URLs. Here are some use cases I could of for both ways.

#1 return the same key for the same URL
PRO: Save storage space
CONS:

You pretty much have to go with a hashing based solution which has its own problems regarding conflicts
It’s more complicated to implement when you start to thinking about what are identical urls really mean? Like for example, if you have a raw url and an encoded url, should it also return the same key? What about with a URL with and without a protocol?

#2 Generate a new key for every URL
PRO:

Easier to implement
Can easily avoid conflicts with the right implementation
Can allow future enhancements or premium services such are providing analytics. For example if you have a premium service where users pay to get analytics then they’ll want to know how many hits they’re getting from the short URL they paid for. If you were to return the same key for the same URL, two users will get the same URL and there’s no way to know if traffic is coming from their campaign or from somewhere else.

CONS:

Takes up more storage but storage is cheap today

Also,
For the points in #1 & #4, we can afford to have latency in the KGS because it’s happening offline. To the users, when asking for a new key, we can guarantee there will be no conflicts, and getting a key back is very fast. For the hash based method, there’s no way to guarantee 0% conflicts and when there’s a conflict, the latency produced for finding a new key is noticeable to the end users.