educative.io

URL Deduping clarifying

Considering 15 billion distinct URLs and 4 bytes for checksum, we would need:

15B * 4 bytes => 60 GB

Is it really enough to use 4 bytes checksum for 15B values? We can encode only 2^32 => ~4B values with 4 bytes and we would also like to avoid collisions. Am I missing something?

Also there’s missing Bloom filter memory consumption analysis. It can take dozens of GB to encode 15B values even with relatively high FP rate. Should it be somehow adjusted to be applied? https://hur.st/bloomfilter/?n=15000000000&p=1.0E-5

1 Like