// computers › Hash-Based

Bloom Filter

A memory-light structure that says 'definitely not present' for certain, or 'maybe present'.

set k bits at hash1(x), hash2(x), ... hashk(x); query: all those bits set => maybe, any unset => definitely no

Items added to the filter (comma-separated) Query Bits Hashes (k)

Frequently asked questions

How is a Bloom filter like a teacher ruling out a name instantly?

A teacher glancing at the roll can say 'Kavya is definitely not in this class' in an instant, so they never bother digging through full records. A Bloom filter gives exactly that kind of fast, certain 'no'. It cannot promise a definite 'yes' - for that you check properly - but its rock-solid 'definitely not here' lets you skip the expensive full search for the vast majority of misses, which is its whole purpose as a cheap pre-check.

How can a Bloom filter be certain about 'no' but only guess about 'yes'?

When an item is added, the filter sets the specific bits its hashes point to. So if you query an item and find any one of its bits still 0, that item could not possibly have been added - setting it would have flipped that bit to 1. That makes 'definitely not present' a certainty. But a 'yes' is only a maybe: those bits might all be 1 because other items happened to set them, not because your item was ever added. Hence false positives are possible, false negatives never.

What is a false positive here, and why is it tolerable?

A false positive is the filter saying 'possibly present' for an item that was never added, because unrelated items collectively set all of its bits to 1. It is tolerable because of how Bloom filters are used: as a cheap first gate before an expensive real lookup. A false positive just means you do the expensive check unnecessarily and find nothing - a small waste. What you can always trust is the 'no', which lets you skip the expensive check entirely for the vast majority of misses.

Why use a Bloom filter instead of just a hash set?

Memory. A hash set must store every actual item (or its full hash), which is expensive at huge scale. A Bloom filter stores only a bit array - no items at all - so it can represent membership of millions of elements in a tiny fraction of the space. The trade-off is the occasional false positive and the inability to list or remove items. When you have billions of keys and only need a fast 'probably not here' gate, that space saving is enormous.

How do the number of bits and number of hashes affect accuracy?

More bits relative to the number of items means fewer collisions and so fewer false positives. The number of hash functions k has a sweet spot: too few and the bit patterns are not distinctive enough; too many and the array fills with 1s too quickly, again raising false positives. There is a known formula for the optimal k given the array size and expected item count. The demo lets you change both so you can watch the array fill up and false positives become more likely.

Where are Bloom filters used in real systems?

Databases like Cassandra and HBase use them to avoid disk reads: before fetching a row from disk, a Bloom filter says whether it could possibly exist, skipping the read entirely on a 'no'. Web browsers and security tools use them to check URLs against huge malicious-site lists cheaply. CDNs and caches use them to decide whether content is worth looking up. Spell-checkers and de-duplication systems use them too. The common thread is a cheap, memory-light 'is this definitely not here?' gate in front of expensive work.