How KGS will handle collisions?

Vijay_Kumar · July 27, 2020, 2:29pm

What should be our approach if KGS has collosins ?

Artur_Baruchi · April 10, 2020, 5:03pm

Hi @Vijay_Kumar, hope you are doing well!
The KGS is an offline service, so it will be generating the hash for us beforehand. To handle collisions is easy, every time a hash is created, check if this exists in the DB (we can have this indexed, but since it is an offline job we don’t need to worry so much about performance when searching for a hash). Basically:

Generate a hash
Does this hash already exists in our table? If yes, ignore it and go back to step 1. Else, store it.

Hope this helps!

Thanks,
Artur Baruchi

Preston_Doris · September 17, 2020, 1:54pm

Hi @Artur_Baruchi, can you explain how the Hash would be generated? I have been trying to understand how the KGS is actually creating these ~6.8 Billion unique keys.

Artur_Baruchi · September 17, 2020, 5:54pm

Hi Preston,
Hope you are doing well.

You can create this 6.8B of keys in several different ways, for example, you can have some function that generates it based in some random seed or you can use some library that generate UIDs and get last 5 bytes or even create some random string with 5 bytes. Benefit of creating it beforehand is speed and you will avoid collisions, since you already take it into account when creating these keys.

You can, of course, create it using the URL you want to reduce on demand (take a look in this book: “System Design Interview – An insider’s guide”, there is a Kindle version which is cheaper). Basically, you can perform a base conversion, if your shortened URL can contain only [0-9aA-zZ], you are using a base 62. But what you gonna use to convert to Base 62? Well, you can use the ID created by the Database (supposing it is a number) and convert it to Base 62. My point is, every time you have to use something to create a hash you will have to handle this random input and once you create it, you have to check if you didn’t create it before and there are several ways to do it (creating it, checking for collisions, etc). In an interview, this is gold, if you know tradeoffs of your solutions… (after all, distributed systems is all about tradeoffs).

Main tradeoffs here are:

Creating beforehand, you have the benefit to speedup the process, since you already have all the hashes. When using a hash you just mark it as used;
Creating when you receive the URL, on demand, you don’t have to use additional storage for storing hashes that you don’t even know if will be used one day, which can be expensive.

Hope this helps!

Artur

rahul9 · December 21, 2020, 6:01pm

why cant we use db auto increment ??

Artur_Baruchi · December 21, 2020, 7:18pm

Hi @rahul9,

If I understand, you want to know why we cant use the auto increment feature to create unique keys, right? Well… the reason is the number of combination. For instance, let’s take a short URL with lenght of 5 characters. The DB Auto increment works with integer numbers, so we can have any number between 0-9 in each position. Therefore, if we have this 5 available positions and assuming we can have repeated number in each position, we would have 10^5 possible combinations… or 100k combinations.

When using base62 (which stands for a-zA-Z0-9) we would have 62^5, or ~916Millions of unique combinations. So, for a system working in this scale, working only with integers is not feasible.

Hope that helps!

Artur Baruchi (Twitter @abaruchi)

rahul9 · December 21, 2020, 7:48pm

Hey,thanks for quick reply.I means take a unique number from db and do base 62 encoding on that number. Assume that we start from the number whose base 62 has exactly 5 length.Or even we can start db count from 0 and in case the hash is of length less than 5 ,we will pad some char in front like 0, it is always unique and we can got till we cross int value whose hash is more than 5.I dont see any problem with this.Please clarify?