The Core Problem

Imagine you are building a web crawler. You have visited 10 billion URLs. Before visiting a new link, you need to check if you've already seen it. Storing 10 billion strings in a standard Hash Set or Database index is astronomically expensive in terms of RAM and disk I/O.

A Bloom Filter solves this by using a tiny fraction of the memory, at the cost of a small probability of being wrong (False Positives).

Advertisement

How It Works

A Bloom Filter consists of:

  1. A bit array of size m, initialized to all 0s.
  2. k independent hash functions, each mapping an input to an index between 0 and m-1.

Operations

  • Add: To add an item, feed it to all k hash functions to get k array indices. Set the bits at all these indices to 1.
  • Query: To check if an item exists, feed it to the same k hash functions.
    • If any of the bits at the resulting indices are 0, the item is definitely not in the set.
    • If all bits are 1, the item might be in the set (or it's a False Positive).
Advertisement

Interactive Visualization

Try adding strings like "apple", "banana", and "cherry". Then try querying for them. Notice how the bits light up. Try querying "date" (which you haven't added) and see if the bits overlap with previous entries to create a False Positive.

The Mathematics of False Positives

Why do false positives happen? Because multiple items might flip the same bits by coincidence. If you query a new item and its hash functions happen to point to bits that were already turned on by other items, the filter will incorrectly claim existence.

The probability of a false positive depends on:

  • m: Size of bit array (larger is better).
  • n: Number of items inserted.
  • k: Number of hash functions.

You can tune these parameters. For a 1% false positive rate, you generally need about 9.6 bits per item.

Real-World Use Cases

  • Databases (Cassandra, HBase, Postgres): Used to avoid expensive disk lookups for non-existent rows. If the Bloom filter says "no", the DB doesn't even touch the disk SSTables.
  • CDNs (Akamai): To avoid caching "one-hit wonders" (content accessed only once). Only cache items that pass the Bloom filter check (meaning they've been seen at least once before).
  • Cryptocurrencies: Bitcoin uses Simplified Payment Verification (SPV) wallets that use Bloom filters to request only relevant transactions from full nodes without revealing exact addresses.
  • Medium: Uses Bloom filters to prevent recommending articles you've already read.

Limitations

  • No Deletion: You generally cannot remove an item from a standard Bloom Filter. Turning a bit from 1 to 0 might inadvertently remove another item that shared that bit. (Counting Bloom Filters solve this but use more space).
  • Probabilistic: You must handle the false positive case in your application logic (e.g., by falling back to a disk check).