Bloom filter for dedupe

Dewey_Munoz · October 3, 2021, 11:40pm

Can we use bloom filters for deduping? Bloom filters are a probabilistic data structure for set membership testing that may yield false positives. A large bit vector represents the set. An element is added to the set by computing ‘n’ hash functions of the element and setting the corresponding bits. An element is deemed to be in the set if the bits at all ‘n’ of the element’s hash locations are set. Hence, a document may incorrectly be deemed to be in the set, but false negatives are not possible.

why can’t bloom filter be used for de-duping. is it because it may have false positive. then, we may say that a url was already seen, when it really hadn’t been. this would cause us to skip that url.

Shaheryaar_Kamal · October 8, 2021, 11:20am

Hi @Dewey_Munoz

Yes, what you thought is absolutely right.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.

For example, checking the availability of username is a set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means , it might tell that given username is already taken but actually, it’s not.

You can also check this lesson if you still feel any confusion. You can also reach me here whenever you want.

Happy learning,