For the encoding scheme, it says “If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). Now we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters.”
For the first italicized part, where did the 8 characters come from? From the URL table earlier in the article, the PK, hash, is defined as a varchar(16) value, so why is the space constraint not 16 characters but 8? Also, in the TinyURL example in Section 1, the shortened URL “http://tinyurl.com/jlg8zpc” has a short key of 7 characters, not 8 not 16. So what length are we exactly talking about here? I’m confused.
For the second italicized part, how does "choose some other characters out of the encoding string or swap some characters” resolve the problem of key duplication? Can someone illustrate with an example, please? In my imagination/understanding, choosing some other characters means:
Let’s take the MD5 hash for
Base64 encoding for the hash above is:
NmQyMGFjZWRmZjU1MWU2NThkYzRiZmUxZDNmMGFjYmI= (Also, why is this string way more than 21 characters long (44 characters, but it says in the article that after base64 encoding we’ll get a string having more than 21 characters, ok, 44 > 21 but I was expecting something like 22)?)
Choose 6 characters out of the encoded string other than the first 6:
QyMGRm (indices 2-5, 10-11 chosen arbitrarily here)
But let’s say there’s another encoded string and the characters from the same indices also result in the same key QyMGRm . Is this possible? Even if you choose 6 characters from other indices (different from 2-5 and 10-11) for different encoded strings, it can result in the same key, too, right? If so, how does this resolve key duplication? The same goes with swapping characters.