Sagi Kedmi

Sagi Kedmi

Jul 29, 17

Bloom Filters for the Perplexed

A rusty engineer's take on a potent data structure


Discussion on Hacker News.

Bloom filters are one of those simple and handy engineering tools that any engineer should have in their toolbox.

It is a space-efficient probabilistic data structure that represents a set and allows you to test if an element is in it.

They are really simple to construct and look like this:

bloom filter

The cool thing is that to achieve such space-efficiency Bloom filters allow for errors with an arbitrarily small probability!

Remarkably, a Bloom filter that represents a set of 11 million items with an error rate of 0.010.01 requires only 95850599585059 bits (~1.14MB\text{\textasciitilde}1.14 \text{MB})! Irrespective of the length of each element in SS!

There’re many kinds of Bloom filters. We’ll cover the Standard Bloom Filter as it is the basis for the others and also serves as a nice intro for more advanced probabilistic set-membership data structures such as Cuckoo Filter.

What’s on our menu for today?

All code snippets can be found on Github.

Academic fellows might want to review Bloom’s original 19701970 paper [1]. Strikingly, Bloom has only 33 references in his article!

Standard Bloom Filters

A Bloom filter represents a set S={x1,x2,,xn}S=\{x_1, x_2, \ldots, x_n\} of nn elements by an array of mm bits (initially set to 00). Lets call it BB.

B=[b1,b2,b3,b4,,b_m1,bm]B = [b_1, b_2, b_3, b_4, \ldots, b\_{m-1}, b_m]

A Bloom filter uses kk independent hash functions h1,,hkh_1, \ldots, h_k that are assumed to be uniformly distributed over the range {1,,m}\{1, \ldots, m \}.

Standard Bloom filters have two operations:

1. Add(xx) - Add xx to the Bloom filter.

For 1ik set B[hi(x)]=1\text{For} \space 1 \leq i \leq k \space \text{set} \space B[h_i(x)]=1

To represent set SS by Bloom filter BB we need to Add all xSx \in S.

2. Contains(yy) - Check if yy is in the Bloom filter.

If for all 1ik , B[hi(y)]==1return true, otherwise false\text{If for all} \space 1 \leq i \leq k \space, \space B[h_i(y)]==1 \\ \text{return true, otherwise false}

To make things clearer, here is a hand-drawn illustration (heavily inspired by Julia Evans (@b0rk) blog illustrations):

hand drawn bloom filter

False-Negatives and False-Positives

If Contains(yy) returns False, clearly yy is not a member of SS. This property is why we say Bloom filters have zero false-negatives. A situation where Contains(ee) is False and eSe \in S simply can’t happen (because otherwise all relevant bits were 11).

If Contains(yy) returns True, yy may or may not be a member of SS. This is why we say that Bloom filters may have false-positives, because a situation where Contains(ee) is True and eSe \notin S can occur.

Arbitrarily Small Probability of Error (False-Positive)

The cool thing about Bloom filters is that based on nn (number of elements in the set SS) and a chosen probability of false positives PFPP_{FP} we can derive optimal kk (number of hash functions) and mm (length of bit vector BB).

These are the formulae (we derive those in the Appendix):

k=lnPFPln2,m=nlnPFP(ln2)2k=-{\frac {\ln P_{FP}}{\ln 2}}, \enspace m=-{\frac {n\ln P_{FP}}{(\ln 2)^{2}}}

A very cool result is that the optimal number of hash functions kk depends only on the false-positive probability PFPP_{FP}.

Lets play with it:

import math

def optimal_km(n, p):
  ln2 = math.log(2)
  lnp = math.log(p)
  k = -lnp/ln2
  m = -n*lnp/((ln2)**2)
  return int(math.ceil(k)), int(math.ceil(m))

print(optimal_km(10**6, 0.01)) # Returns (7, 9585059)

Remarkably, a Bloom filter that represents a set of 11 million items with a false-positive probability of 0.010.01 requires only 95850599585059 bits ( 1.14MB~1.14 \text{MB}) and 77 hash functions. Only 9.69.6 bits per element! (95850591000000\frac{9585059}{1000000}).

Toy Implementation

The first thing that comes to mind is how exactly do we get a family of hash functions h1,,hkh_1, \ldots, h_k that are uniformly distributed?

Since our objective is learning and we are not aiming for efficiency, we can simply build our family of hash functions on top of an existing hash function as a primitive. Such as SHA256.

One such construction is the following:

hi(s)=SHA256(i  s) mod mh_i(s) = \text{SHA256}(i\space||\space s) \text{ mod } m


  • ii - the number of the hash function.
  • ss - the string to be hashed.
  • mm - the length of the bit vector.
  • || - string concatenation.

And with Python:

def h_i(i, m, s):
  return int(sha256(bytes([i]) + s).hexdigest(), 16) % m

Recall that the standard bloom filter had two simple operations:

class bloom_filter:
  def __init__(self, m, k, h=h_i):
    self.m = m # length of bit array
    self.k = k # number of hash functions
    self.h = h # hash function
    self.bits = bitarray(self.m) # the actual bit store

  def add(self, s):
    for i in range(self.k):
      self.bits[self.h(i, self.m, s)] = 1

  def contains(self, s):
    for i in range(self.k):
      if self.bits[self.h(i, self.m, s)] == 0:
        return False
    return True # all bits were set

Example: Efficiently Verify Compromised SSH Keys

In 2008, Luciano Bello discovered that Debian’s OpenSSL random number generator was seriously flawed and depended only on a process id integer.

Without getting into the gory details, essentially it means that between 2006 and 2008, cryptographic keys that were generated using OpenSSL on Debian-based operating systems were limited to 32768 different possible keys given their type (e.g. RSA) and size (e.g. 1024 bit).

Fortunately, g0tmi1k (@g0tmi1k), generated all possible compromised keys and uploaded them to a repo.

Say we’d like to build a system that allows users to query whether their keys were compromised.

Moreover say we’d like our system to use a minimal amount of bandwidth and/or requests to the server.

Using a bloom filter for that purpose fits perfectly! Lets see why:

Note that the initial client request is just a fetch (GET) request, the actual key X isn't sent with it.

Since Bloom filters have zero false negatives, if key XX isn’t in Bloom filter BB we know that XX isn’t compromised with certainty! Therefore, no more requests needed.

The other case is that XX is in BB:

Note that the initial client request is just a fetch (GET) request, the actual key X isn't sent with it.

So a request to verify the state of XX must be sent to the server as it may be either a true positive (XX is in BB and XX is compromised) or a false positive (XX is in BB but XX isn’t compromised).

The nice thing about bloom filters is that the size of the Bloom filter is dependent on the false positive probability that is arbitrarily chosen by us!

So if we notice a lot of false positives we can just tune the Bloom filter to our needs and sacrifice some bandwidth for more accuracy.

In Github there’s which populates a toy Bloom filter with the compromised keys and tests its effectiveness:

$ ./
[+] Number public keys in ./dsa/1024: 32768. Total length: 19004952 bytes (18.12Mb).
[+] False-positive probability of 0.001.
[+] Optimal number of bits is 471125 (57Kb). 322 times more space efficient.
[+] Optimal number of hash functions is 10.
[+] Test: All public keys were found within the bloom filter.
[+] Test: Average number of false positives: 21, false positive rate: 0.0213. Averaged over 10 tests, 1000 random strings in each test.

Notice that the bloom filter is 322322 more space efficient than the actual length of the public keys (18.12Mb vs. 57Kb)!

If you’d like to run it yourself make sure to follow the simple installation instructions.

For completeness, here’s the code of

#!/usr/bin/env python3
'''A test script that queries vulnearble keys (debian openssl debacle)
in a space efficient manner using Bloom filters.'''

import glob
import random
from string import ascii_letters
import bloom

def random_string(size=10):
  rand_str = ''.join(([random.choice(ascii_letters) for i in range(size)]))
  return str.encode(rand_str)

def empirical_false_positive_rate(bf, nr_tests=1, nr_strings=1000):
  c = 0
  for i in range(nr_tests):
    rand_strings = [random_string(30) for i in range(nr_strings)]
    t = 0
    for r in rand_strings:
      if bf.contains(r):
        t += 1
    c += t
  avg_fpr = ((c/nr_tests)*(1/nr_strings))
  avg_errs = c/nr_tests
  return (int(avg_errs), avg_fpr)

if __name__ == '__main__':
  public_keys = set()
  total_keys_bytes= 0

  for pk_file in glob.glob('./dsa/1024/*.pub'):
    pk_base64 = open(pk_file, 'rb').read().split()[1]
    total_keys_bytes += len(pk_base64)

  n = len(public_keys)
  print('[+] Number public keys in ./dsa/1024: {}. Total length: {} bytes ({:0.2f}Mb).'.format(n, total_keys_bytes, total_keys_bytes/(2**20)))

  p = 1/1000
  print('[+] False-positive probability of {}.'.format(p))

  k, m = bloom.optimal_km(n, p)
  t = int((total_keys_bytes*8)/m)
  print('[+] Optimal number of bits is {} ({}Kb). {} times more space efficient.'.format(m, int(m/(8*(2**10))), t))
  print('[+] Optimal number of hash functions is {}.'.format(k))

  # Populating our bloom filter.
  bf = bloom.bloom_filter(m, k)
  for pk in public_keys:

  # Testing that all public keys are inside the bloom filter
  all_pks_in_bf = True
  for pk in public_keys:
    if not bf.contains(pk):
      all_pks_in_bf = False

  if all_pks_in_bf:
    print('[+] Test: All public keys were found within the bloom filter.')
    # Can't be...
    print('[-] Test: One or more public key were not found within the bloom filter.')

  # Testing the empirical false positive rate by generating random strings
  # (that are surely not in the bloom filter) and check if the bloom filter
  # contains them.
  nr_tests = 10
  nr_strings = 1000
  avg_errs, avg_fpr = empirical_false_positive_rate(bf, nr_tests, nr_strings)
  print('[+] Test: Average number of false positives: {}, false positive rate: {:0.4f}. Averaged over {} tests, {} random strings in each test.'.format(avg_errs, avg_fpr, nr_tests, nr_strings))


In their 1994 paper, Andrei Broder and Michael Mitzenmacher coined the Bloom Filter Principle:

Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

Andrei Broder and Michael Mitzenmacher

Below are some cool applications I’ve bumped into over the years.

Breaking Bitcoin (and Ethereum) Brainwallets

In DEFCON 23 (2015), Ryan Castellucci (@ryancdotorg) presented Brainflayer. A cryptocurrency brainwallet cracker that uses a Bloom filter under the hood.

I specifically abstracted away low level Bitcoin technicalities.

Bitcoin’s blockchain is made out of a sequence of blocks. Each block contains transactions and each transaction cryptographically instructs to transfer XX Bitcoins from a previous transaction to a Bitcoin address.

How does a Bitcoin address is generated?

A Bitcoin address AA is simply a hashed ECDSA public-key QQ. Lets denote A=H(Q)A = H(Q).

How does an ECDSA public key is generated?

An ECDSA public key QQ is the outcome of multiplying a private key dd with some known constant base point GG. That is, Q=d×GQ = d \times G.

How does an ECDSA private key is generated?

An ECDSA private key dd is simply an integer that is preferably generated using a cryptographically secure random number generator. Anyone that knows dd can redeem Bitcoins that were sent to AA.

What’s a Brainwallet address?

A Brainwallet is simply a Bitcoin address where its corresponding private key dd was generated using a mnemonic (!) rather then a secure random number generator. One possible Brainwallet construction looks like:

d=SHA256(MNEMONIC)d = \text{SHA256(MNEMONIC)}

For instance, the string (embedded in Bitcoin’s Genesis block):

The Times 03/Jan/2009 Chancellor on brink of second bailout for banks

When hashed with SHA256 yields the following ECDSA private key:


That yields the following Bitcoin address:


Using a blockchain explorer we can check the address’s history and see that it was used :)

Using a Bloom filter to crack Brainwallets

It is rather straight forward:

  1. Extract all Bitcoin addresses from the Blockchain.
  2. Add them to a Bloom filter BB.
  3. Generate a set WW of Bitcoin addresses using plausible mnemonics (“Brainwallets”).
  4. Gather possible candidates set CC - addresses from WW that BB returns positive on.
  5. Filter out false positives by checking which addresses in CC exist on the Blockchain.

Currently, Bitcoin’s blockchain contains about 440M440M unique addresses (according to Quandl - press on Cumulative).

Ryan’s implenetation contains a bloom filter of length 512MB512MB and 2020 hash functions (he actually uses simple functions over the original public key hashes - no heavy cryptographic hash - neat).

In the Appendix we derived an experssion for the false positive probability:

PFP=(1P0)k=(1(11m)kn)kP_{FP} = (1 - P_0)^{k} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^{k}

Where kk is the number of hash functions, nn the length of the set and mm the length of the bloom filter.

def exact_pfp(n, m, k):
  return ( 1 - (1 - (1/m))**(k*n) )**k

m_bytes = 512*1024*1024
m_bits = m_bytes * 8
k = 20

print(exact_pfp(440*(10**6), m_bits, k)) # ~0.063
print(exact_pfp(220*(10**6), m_bits, k)) # ~0.000137
print(exact_pfp(110*(10**6), m_bits, k)) # ~1.14e-08
print(exact_pfp(80*(10**6), m_bits, k))  # ~7.16e-11

As one can tell from above, a Bloom filter of length 512MB512MB and 2020 hash functions has a high false positive rate on 440M440M items.

In 2015, when Ryan released Brainflayer, there were about 80M80M unique addresses on the blockchain and such a construction yielded a very low false postive rate of 7.16×1011\sim 7.16\times 10^{-11}.

Today, in order to maintain a similar false positive rate, one would have to either dissect the blockchain to chunks and use several Bloom filters of length 512MB512MB, or simply increase the length of the bloom filter and the number of hash functions.

I highly recommened you to go and watch Ryan’s talk.

Recommender System Optimization

Simply put, a recommender system RR is an algorithm that predicts the preferences of a user.

For instance, you’ve just bought The Hard Thing About Hard Things. Amazon’s recommender system learns that you like books about entrepreneurship and start show you other book recommendations.

But, how do you keep track of irrelevant recommendations such as books the user already read or dismissed ?

Say XX books were recommended by RR. XX is comprised of YY relevant and ZZ irrelevant recommendations (YZ=Y \cap Z = \emptyset).

Using a Bloom filter BZB_Z to store irrelevant recommendations ZZ is one possible answer.

First, say we’re dealing with a lot of irrelevant recommendations, then since Bloom filters are space efficient, it reduces the data stored per user substantially, and even bandwidth (if your logic resides in client side).

Second, since Bloom filters have zero false-negatives, we are guarunteed that a user will never be shown irrelevant recommendations (the event where eZe \in Z and BZ.contains(e)B_{Z}.contains(e) is false can’t happen).

Third, since Bloom filters may have false-positives, a relevant recommendation may not be shown to the user (the event where eYe \in Y and BZ.contains(e)B_{Z}.contains(e) is true). But that’s ok since nothing will happen if a user won’t get a relevant recommendation once every 1K visits.

I’d like to thank Sébastien Bratières for spotting an error in this clause.

Search Engine Optimizations

In his 2016 Strange Loop talk, Dan Luu (@danluu), shared some of the internals of one of Bing’s production search indices: BitFunnel (hint: they use Bloom filters, but not in a trivial way).

The talk is short and built in a very constructive way:

You might want to review BitFunnel’s SIGIR’17 paper.


While studying the math behind Bloom filters I found myself rusty and confused. Here is a guided tour for the rusty-confused engineer :)

False-Positive Probability and Formulae

Assume Bloom filter BB was populated with elements from SS.

A false-positive is an input yy that isn’t an element of SS but B[h1(y)],,B[hk(y)]B[h_1(y)], \ldots, B[h_k(y)] are all set to 11.

The false-positive probability PFPP_{FP} then is simply the probability of having any kk arbitrary cells set to 11 after Bloom filter BB is populated with elements from SS.

Lets define P1P_1 to be the probability that a certain bit is set to 11 after we populate the Bloom filter.

Given P1P_1, the probability of a false positive, PFPP_{FP} is simply:

PFP=P1kP_{FP} = P_{1}^{k}

But, how do we calculate P1P_{1} ?

Lets assume now that Bloom filter BB is empty and we begin add elements from SS.

Given x1x_1 we compute h1(x1)h_1(x_1) and set B[h1(x1)]B[h_1(x_1)] to 11. What’s P1P_1 now?

P1=1mP_1 = \frac{1}{m}

because we assume hih_i are uniform over {1,,m}\{1, \ldots, m\}.

Next, we compute h2(x1)h_2(x_1) and set B[h2(x1)]B[h_2(x_1)] to 11. What’s P1P_1 now?

P1=1m+1m1m2P_1 = \frac{1}{m} + \frac{1}{m} -\frac{1}{m^2}

Wait! Why did you subtract 1m2\frac{1}{m^2} ?

Recall that we defined P1P_1 to be the probability that a specific bit is set to 11. This specific bit might be set to 11 by either h1h_1 or h2h_2 or both. That is, we search for the probability of the union of independent events that are not mutually exclusive.

We subtract 1m2\frac{1}{m^2}, since otherwise we “count” the event that both h1h_1 and h2h_2 set the bit to 11 twice. This is due to the Inclusion-Exclusion Principle.

Doing the same thing for h3h_3. What’s P1P_1 now?

P1=3m3m2+1m3P_1 = \frac{3}{m} -\frac{3}{m^2} +\frac{1}{m^3}

As evident from above, if we continue this way we end up with a rather intricate expression for P1P_1. For this reason, most derivations of the false-positive probability use the complementary event to go around it.

Lets define P0P_0 to be the probability that a certain bit is not set to 11.

If we knew P0P_0 we could easily compute P1P_1 and PFPP_{FP}:

PFP=P1k=(1P0)kP_{FP} = P_{1}^{k} = (1 - P_0)^{k}

So, how do we calculate P0P_{0} ?

Lets start with an empty Bloom filter BB again and add elements from SS.

Given x1x_1 we compute h1(x1)h_1(x_1) and set B[h1(x1)]B[h_1(x_1)] to 11. What’s P0P_0 now?

P0=11mP_0 = 1 - \frac{1}{m}

Next, we compute h2(x1)h_2(x_1) and set B[h2(x1)]B[h_2(x_1)] to 11. What’s P0P_0 now?

P0=(11m)2P_0 = \left(1 - \frac{1}{m}\right)^2

Wait! Why does P1P_1 and P0P_0 differ at this stage of the analysis?

Because when calculating P1P1 we wanted the probability of the event that at least one of the hash functions sets a specific bit. But, for P0P_0 we want the probability that all hash functions does not set a specific bit!

That is, we search for the probability of the intersection of independent events that are not mutually exclusive.

Setting the bits for the other hash functions h3(x1),h4(x1),,hk(x1)h_3(x_1), h_4(x_1), \ldots, h_k(x_1):

P0=(11m)kP_0 = \left(1 - \frac{1}{m}\right)^k

And since we add nn elements:

P0=(11m)knP_0 = \left(1 - \frac{1}{m}\right)^{kn}


PFP=(1P0)k=(1(11m)kn)kP_{FP} = (1 - P_0)^{k} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^{k}

Since we want to find the optimal kk and mm such that PFPP_{FP} (the false probably rate) is minimized, we will need to differentiate it.

The expression for PFPP_{FP} isn’t easy to differentiate, this is why most derivations use a neat trick that represents it with ee (Euler’s number).

One of ee’s many definitions is:

limn(1+1n)n=e\lim_{n\to\infty} \left( 1 + \frac{1}{n}\right) ^n = e

That is, for large enough nn, ee is approximately (1+1n)n\left( 1 + \frac{1}{n}\right)^n.


(11m)kn=(1+1m)mknm=((1+1m)m)knmeknm\begin{aligned} \left(1 - \frac{1}{m}\right)^{kn} &= \left(1 + \frac{1}{-m}\right)^{\frac{-mkn}{-m}}\\ &= \left(\left(1 + \frac{1}{-m}\right)^{-m}\right)^{\frac{kn}{-m}}\\ &\approx e^{\frac{kn}{-m}} \end{aligned}

Hence, PFPP_{FP} can be reduced to the following expression:

PFP(1eknm)kP_{FP} \approx \left(1 - e^{\frac{kn}{-m}}\right)^{k}

Although we got a nicer expression for PFPP_{FP}, we would like to differentiate it with respect to kk (to find the the optimal number of hash functions). But kk is the exponent.

To differentiate such equations usually one uses some ln\ln (natural logarithm) trickery.

For any function f(x,y)=g(y)xf(x,y) = g(y)^{x} (gg is simply a function of yy) it holds that:

f(x,y)=eln(f(x,y))=eln(g(y)x)=exln(g(y))\begin{aligned} f(x,y) &= e^{ln(f(x,y))}\\ &= e^{ln(g(y)^x)}\\ &= e^{x \cdot ln(g(y))} \end{aligned}

Therefore, in the case of our PFPP_{FP}:

PFP=eln((1eknm)k)=ekln((1eknm))P_{FP} = e^{ln(\left(1 - e^{\frac{kn}{-m}}\right)^{k})} = e^{k \cdot ln(\left(1 - e^{\frac{kn}{-m}}\right))}

Lets denote g=kln((1eknm))g = k \cdot ln(\left(1 - e^{\frac{kn}{-m}}\right)), which makes PFP=egP_{FP} = e^{g}.

Since ee is a monotonically increasing function minimizing ege^{g} is equivalent to minimizing gg with respect to kk.

Differentiating gg with respect to kk (WolframAlpha):

gk=ln(1eknm)+knmeknm1eknm\frac{\partial g}{\partial k} = ln \left( 1 - e^{\frac{-kn}{m}}\right) + \frac{kn}{m} \cdot \frac{e^{\frac{-kn}{m}}}{1 - e^{\frac{-kn}{m}}}

To find the optimal kk, the one that minimizes the above expression, one needs to equate it to 00

It is easy to verify that the derivative is 00 when k=ln2mnk=ln 2 \cdot \frac{m}{n}:

gk=ln(1eln2)+ln2eln21eln2=ln(112)+ln21212=ln(12)+ln2=0\begin{aligned} \frac{\partial g}{\partial k} &= ln \left( 1 - e^{-ln2}\right) + ln2 \cdot \frac{e^{-ln2}}{1 - e^{-ln2}}\\ &= ln \left( 1 - \frac{1}{2}\right) + ln2 \cdot \frac{\frac{1}{2}}{\frac{1}{2}}\\ &= ln \left(\frac{1}{2} \right) + ln2\\ &= 0 \end{aligned}

Great! Now we know that k=ln2mnk=ln 2 \cdot \frac{m}{n} is the optimal number of needed hash functions.

A very cool result of bloom filters is that kk only depends on a given PFPP_{FP}.

To achieve that result, lets plug the kk we got in equation we derived for the false positive probability:

PFP(1eknm)k=(112)ln2mn=(2)ln2mn\begin{aligned} P_{FP} &\approx \left(1 - e^{\frac{kn}{-m}} \right)^{k}\\ &= \left(1 - \frac{1}{2}\right)^{ln2 \cdot \frac{m}{n}}\\ &= \left(2\right)^{-ln2 \cdot \frac{m}{n}}\\ \end{aligned}

Taking lnln on both sides:

ln(PFP)=mn(ln2)2ln(P_{FP}) = -\frac{m}{n} \cdot (ln2)^{2}

Therefore, mm, the number of bits in our bloom filter (number of bits in BB) is:

m=nlnPFP(ln2)2m=-{\frac {n\ln P_{FP}}{(\ln 2)^{2}}}

Pluggin that mm into the kk we got yields:

k=lnPFPln2k=-{\frac {\ln P_{FP}}{\ln 2}}

And voila, kk depends only on a given false positive probability PFPP_{FP}!


Bloom filters are simple and useful beasts. They were designed in times of great scarcity in memory, but since nowadays we explode in data they fit almost perfectly for our times as well!

If you found this article interesting, make sure to read about Scalable Bloom filters which allows you to add items to an already existing Bloom filter and Counting Bloom filters which allows you to delete items from an already existing Bloom filter.

If you spot an error / have any question please let me know so others may gain :)

Comments and thoughts are also welcome on this tweet:

  1. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970.

© 2023 Sagi Kedmi