Rubiks Cube!

Sagi Kedmi

Bloom Filters for the Perplexed

29.07.2017, in { algo }

Bloom filters are one of those simple and handy engineering tools that any engineer should have in their toolbox.

It is a space-efficient probabilistic data structure that represents a Set and allows you to test if an element is in it.

They are really simple to construct and look like this:

Inserting to and Querying a Bloom Filter

The cool thing is that to achieve such space-efficiency Bloom filters allow for errors with an arbitrarily small probability!

Remarkably, a Bloom filter that represents a set of $1$ million items with an error rate of $0.01$ requires only $9585059$ bits ($~1.14 \text{MB}$)! Irrespective of the length of each element in $S$!

There’re many kinds of Bloom filters. We’ll cover the Standard Bloom Filter as it is the basis for the others and also serves as a nice intro for more advanced probabilistic set-membership data structures such as Cuckoo Filter.

What’s on our menu for today?

All code snippets can be found on Github.

Academic fellows might want to review Bloom’s original $1970$ paper [1]. Strikingly, Bloom has only $3$ references in his article!

Standard Bloom Filters

A Bloom filter represents a set $S=\{x_1, x_2, \ldots, x_n\}$ of $n$ elements by an array of $m$ bits (initially set to $0$). Lets call it $B$.

$$B = [b_1, b_2, b_3, b_4, \ldots, b_{m-1}, b_m] $$

A Bloom filter uses $k$ independent hash functions $h_1, \ldots, h_k$ that are assumed to be uniformly distributed over the range $\{1, \ldots, m \}$.

Standard Bloom filters have two operations:

Add($x$) - Add $x$ to the Bloom filter.

For $1 \leq i \leq k$, $B[h_i(x)]=1$.

To represent set $S$ by Bloom filter $B$ we need to Add all $x \in S$.

Contains($y$) - Check if $y$ is in the Bloom filter.

For $1 \leq i \leq k$, if all $B[h_i(y)]==1$ return True, otherwise False

To make things clearer, here is a hand-drawn illustration (heavily inspired by Julia Evans (@b0rk) blog illustrations):

Inserting to and Querying a Bloom Filter

Figure 1: $S$ has $3$ elements ($n=3$), and is represented by a Bloom filter of length $22$ ($m=22$) with $3$ hash functions ($k=3$). Contains($w$) returns False because $B[h_{orange}(w)]==0$.

False-Negatives and False-Positives

If Contains($y$) returns False, clearly $y$ is not a member of $S$. This property is why we say Bloom filters have zero false-negatives. A situation where Contains($e$) is False and $e \in S$ simply can’t happen (because otherwise all relevant bits were $1$).

If Contains($y$) returns True, $y$ may or may not be a member of $S$. This is why we say that Bloom filters may have false-positives, because a situation where Contains($e$) is True and $e \notin S$ can occur.

Arbitrarily Small Probability of False-Positive

The cool thing about Bloom filters is that based on $n$ (number of elements in the set $S$) and a chosen probability of false positives $P_{FP}$ we can derive optimal $k$ (number of hash functions) and $m$ (length of bit vector $B$).

These are the formulae (we derive those in the Appendix): $$ k=-{\frac {\ln P_{FP}}{\ln 2}}, \enspace m=-{\frac {n\ln P_{FP}}{(\ln 2)^{2}}} $$

A very cool result is that the optimal number of hash functions $k$ depends only on the false-positive probability $P_{FP}$.

Lets play with it:

import math

def optimal_km(n, p):
  ln2 = math.log(2)
  lnp = math.log(p)
  k = -lnp/ln2
  m = -n*lnp/((ln2)**2)
  return int(math.ceil(k)), int(math.ceil(m))

print(optimal_km(10**6, 0.01)) # Returns (7, 9585059)
Remarkably, a Bloom filter that represents a set of $1$ million items with a false-positive probability of $0.01$ requires only $9585059$ bits ($~1.14 \text{MB}$) and $7$ hash functions. Only $9.6$ bits per element! ($\frac{9585059}{1000000}$).

Toy Implementation

The first thing that comes to mind is how exactly do we get a family of hash functions $h_1, \ldots, h_k$ that are uniformly distributed?

Since our objective is learning and we are not aiming for efficiency, we can simply build our family of hash functions on top of an existing hash function as a primitive. Such as SHA256.

One such construction is the following: $$ h_i(s) = \text{SHA256}(i\space||\space s) \text{ mod } m $$

Where:

  • $i$ - the number of the hash function.
  • $s$ - the string to be hashed.
  • $m$ - the length of the bit vector.
  • $||$ - string concatenation.

And with Python:

def h_i(i, m, s):
  return int(sha256(bytes([i]) + s).hexdigest(), 16) % m

Recall that the standard bloom filter had two simple operations:

class bloom_filter:
  def __init__(self, m, k, h=h_i):
    self.m = m # size of vector
    self.k = k # number of hash functions
    self.h = h # hash function
    self.bits = bitarray(self.m) # the actual bit store

  def add(self, s):
    for i in range(self.k):
      self.bits[self.h(i, self.m, s)] = 1

  def contains(self, s):
    for i in range(self.k):
      if self.bits[self.h(i, self.m, s)] == 0:
        return False
    return True # all bits were set

Example: “Did Trump Tweet That”

Applications

In their 1994 paper, Andrei Broder and Michael Mitzenmacher coined the Bloom Filter Principle:

Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

Andrei Broder and Michael Mitzenmacher

Below are some cool applications I’ve bumped into over the years.

Breaking Bitcoin (and Ethereum) Brainwallets

In DEFCON 23 (2015), Ryan Castellucci (@ryancdotorg) presented Brainflayer. A cryptocurrency brainwallet cracker that uses a Bloom filter under the hood.

I specifically abstracted away low level Bitcoin technicalities.

Bitcoin’s blockchain is made out of a sequence of blocks. Each block contains transactions and each transaction cryptographically instructs to transfer $X$ Bitcoins from a previous transaction to a Bitcoin address.

How does a Bitcoin address is generated?

A Bitcoin address $A$ is simply a hashed ECDSA public-key $Q$. Lets denote $A = H(Q)$.

How does an ECDSA public key is generated?

An ECDSA public key $Q$ is the outcome of multiplying a private key $d$ with some known constant base point $G$. That is, $Q = d \times G$.

How does an ECDSA private key is generated?

An ECDSA private key $d$ is simply an integer that is preferably generated using a cryptographically secure random number generator. Anyone that knows $d$ can redeem Bitcoins that were sent to $A$.

What’s a Brainwallet address?

A Brainwallet is simply a Bitcoin address where its corresponding private key $d$ was generated using a mnemonic (!) rather then a secure random number generator. One possible Brainwallet construction looks like: $$ d = \text{SHA256(MNEMONIC)} $$

For instance, the string (embedded in Bitcoin’s Genesis block):

The Times 03/Jan/2009 Chancellor on brink of second bailout for banks

When hashed with SHA256 yields the following ECDSA private key:

a6d72baa3db900b03e70df880e503e9164013b4d9a470853edc115776323a098

That yields the following Bitcoin address:

1Nbm3JoDpwS4HRw9WmHaKGAzaeSKXoQ6Ej

Using a blockchain explorer we can check the address’s history and see that it was used :)

Using a Bloom filter to crack Brainwallets

It is rather straight forward:

  1. Extract all Bitcoin addresses from the Blockchain.
  2. Add them to a Bloom filter $B$.
  3. Generate a set $W$ of Bitcoin addresses using plausible mnemonics (“Brainwallets”).
  4. Gather possible candidates set $C$ - addresses from $W$ that $B$ returns positive on.
  5. Check offline that addresses in $C$ exist on the Blockchain.

Currently, Bitcoin’s blockchain contains about $440M$ unique addresses (according to Quandl - press on Cumulative).

Ryan’s implenetation contains a bloom filter of length $512MB$ and $20$ hash functions (he actually uses simple functions over the original public key hashes - no heavy cryptographic hash - neat).

In the Appendix we derived an experssion for the false positive probability:

$$ P_{FP} = (1 - P_0)^{k} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^{k} $$

Where $k$ is the number of hash functions, $n$ the length of the set and $m$ the length of the bloom filter.

def exact_pfp(n, m, k):
  return ( 1 - (1 - (1/m))**(k*n) )**k   

m = 512*(2**20)
k = 20

print(exact_pfp(440*(10**6), m, k)) # ~0.99
print(exact_pfp(100*(10**6), m, k)) # ~0.61
print(exact_pfp(40*(10**6), m, k))  # ~0.006
print(exact_pfp(20*(10**6), m, k))  # ~0.0000025

As one can tell from above, a Bloom filter of length $512MB$ and $20$ hash functions has a very high false positive rate on $440M$ items.

I think that the more plausible usage is to represent each $20M$ unique addresses with their own Bloom filter. In total - $22$ Bloom filters of length $512MB$.

I highly recommened you to go and watch Ryan’s talk.

Avoid showing the same results in Recommender Systems

Search Engine Optimizations

Papers:

  • Less hashing, same performance: Building a better Bloom filter.
  • Scalable Bloom Filters
  • Counting Bloom Filters
  • Test

Appendix

While studying the math behind Bloom filters I found myself rusty and confused. Here is a guided tour for the rusty-confused engineer :)

False-Positive Probability and Formulae

Assume Bloom filter $B$ was populated with elements from $S$.

A false-positive is an input $y$ that isn’t an element of $S$ but $B[h_1(y)], \ldots, B[h_k(y)]$ are all set to $1$.

The false-positive probability $P_{FP}$ then is simply the probability of having any $k$ arbitrary cells set to $1$ after Bloom filter $B$ is populated with elements from $S$.

Lets define $P_1$ to be the probability that a certain bit is set to $1$ after we populate the Bloom filter.

Given $P_1$, the probability of a false positive, $P_{FP}$ is simply:

$$ P_{FP} = P_{1}^{k} $$

But, how do we calculate $P_{1}$ ?

Lets assume now that Bloom filter $B$ is empty and we begin add elements from $S$.

Given $x_1$ we compute $h_1(x_1)$ and set $B[h_1(x_1)]$ to $1$. What’s $P_1$ now?

$$ P_1 = \frac{1}{m} $$

because we assume $h_i$ are uniform over $\{1, \ldots, m\}$.

Next, we compute $h_2(x_1)$ and set $B[h_2(x_1)]$ to $1$. What’s $P_1$ now? $$ P_1 = \frac{1}{m} + \frac{1}{m} -\frac{1}{m^2} $$

Wait! Why did you substract $\frac{1}{m^2}$ ?

Recall that we defined $P_1$ to be the probability that a specific bit is set to $1$. This specific bit might be set to $1$ by either $h_1$ or $h_2$ or both. That is, we search for the probability of the union of independent events that are not mutually exclusive.

We substract $\frac{1}{m^2}$, since otherwise we “count” the event that both $h_1$ and $h_2$ set the bit to $1$ twice. This is due to the Inclusion-Exclusion Principle.

Doing the same thing for $h_3$. What’s $P_1$ now? $$ P_1 = \frac{3}{m} -\frac{3}{m^2} +\frac{1}{m^3} $$

As evident from above, if we continue this way we end up with a rather intricate expression for $P_1$. For this reason, most derivations of the false-positive probabilty use the complementary event to go around it.

Lets define $P_0$ to be the probability that a certain bit is not set to $1$.

If we knew $P_0$ we could easily compute $P_1$ and $P_{FP}$:

$$ P_{FP} = P_{1}^{k} = (1 - P_0)^{k} $$

So, how do we calculate $P_{0}$ ?

Lets start with an empty Bloom filter $B$ again and add elements from $S$.

Given $x_1$ we compute $h_1(x_1)$ and set $B[h_1(x_1)]$ to $1$. What’s $P_0$ now? $$ P_0 = 1 - \frac{1}{m} $$

Next, we compute $h_2(x_1)$ and set $B[h_2(x_1)]$ to $1$. What’s $P_0$ now? $$ P_0 = \left(1 - \frac{1}{m}\right)^2 $$

Wait! Why does $P_1$ and $P_0$ differ at this stage of the analysis?

Because when calculating $P1$ we wanted the probability of the event that one of the hash functions sets a specific bit. But, for $P_0$ we want the probability that all hash functions does not set a specific bit!

That is, we search for the probability of the intersection of independent events that are not mutually exclusive.

Setting the bits for the other hash functions $h_3(x_1), h_4(x_1), \ldots, h_k(x_1)$: $$ P_0 = \left(1 - \frac{1}{m}\right)^k $$

And since we add $n$ elements: $$ P_0 = \left(1 - \frac{1}{m}\right)^{kn} $$

Therefore: $$ P_{FP} = (1 - P_0)^{k} = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^{k} $$

Since we want to find the optimal $k$ and $m$ such that $P_{FP}$ (the false probably rate) is minimized, we will need to differentiate it.

The expression for $P_{FP}$ isn’t easy to differentiate, this is why most derviations use a neat trick that represents it with $e$ (Euler’s number).

One of $e$’s many definitions is: $$ \lim_{n\to\infty} \left( 1 + \frac{1}{n}\right)^n = e $$

That is, for large enough $n$, $e$ is approximately $\left( 1 + \frac{1}{n}\right)^n$.

Therefore,

$$ \begin{aligned} \left(1 - \frac{1}{m}\right)^{kn} &= \left(1 + \frac{1}{-m}\right)^{\frac{-mkn}{-m}}\\ &= \left(\left(1 + \frac{1}{-m}\right)^{-m}\right)^{\frac{kn}{-m}}\\ &\approx e^{\frac{kn}{-m}} \end{aligned} $$

Hence, $P_{FP}$ can be reduced to the following expression: $$ P_{FP} \approx \left(1 - e^{\frac{kn}{-m}}\right)^{k} $$

Although we got a nicer expression for $P_{FP}$, we would like to differntiate it with respect to $k$ (to find the the optimal number of hash functions). But $k$ is the exponent.

To differentiate such equations usually one uses some $\ln$ (natural logarithm) trickery.

For any function $f(x,y) = g(y)^{x}$ ($g$ is simply a function of $y$) it holds that:

$$ \begin{aligned} f(x,y) &= e^{ln(f(x,y))}\\ &= e^{ln(g(y)^x)}\\ &= e^{x \cdot ln(g(y))} \end{aligned} $$

Therefore, in the case of our $P_{FP}$: $$ P_{FP} = e^{ln(\left(1 - e^{\frac{kn}{-m}}\right)^{k})} = e^{k \cdot ln(\left(1 - e^{\frac{kn}{-m}}\right))} $$ Lets denote $g = k \cdot ln(\left(1 - e^{\frac{kn}{-m}}\right))$, which makes $P_{FP} = e^{g}$.

Since $e$ is a monotonically increasing function minimizing $e^{g}$ is equivalent to minimizing $g$ with respect to $k$.

Differentiating $g$ with respect to $k$ (WolframAlpha): $$ \frac{\partial g}{\partial k} = ln \left( 1 - e^{\frac{-kn}{m}}\right) + \frac{kn}{m} \cdot \frac{e^{\frac{-kn}{m}}}{1 - e^{\frac{-kn}{m}}} $$

To find the optimal $k$, the one that minimizes the above expression, one needs to equate it to $0$

It is easy to verify that the derivative is $0$ when $k=ln 2 \cdot \frac{m}{n}$:

$$ \begin{aligned} \frac{\partial g}{\partial k} &= ln \left( 1 - e^{-ln2}\right) + ln2 \cdot \frac{e^{-ln2}}{1 - e^{-ln2}}\\ &= ln \left( 1 - \frac{1}{2}\right) + ln2 \cdot \frac{\frac{1}{2}}{\frac{1}{2}}\\ &= ln \left(\frac{1}{2} \right) + ln2\\ &= 0 \end{aligned} $$

Great! Now we know that $k=ln 2 \cdot \frac{m}{n}$ is the optimal number of needed hash functions.

A very cool result of bloom filters is that $k$ only dependent on a given $P_{FP}$.

To achieve that result, lets plug the $k$ we got in equation we derived for the false positive probability:

$$ \begin{aligned} P_{FP} &\approx \left(1 - e^{\frac{kn}{-m}} \right)^{k}\\ &= \left(1 - \frac{1}{2}\right)^{ln2 \cdot \frac{m}{n}}\\ &= \left(2\right)^{-ln2 \cdot \frac{m}{n}}\\ \end{aligned} $$

Taking $ln$ on both sides: $$ ln(P_{FP}) = -\frac{m}{n} \cdot (ln2)^{2} $$

Therefore, $m$, the number of bits in our bloom filter (number of bits in $B$) is:

$$ m=-{\frac {n\ln P_{FP}}{(\ln 2)^{2}}} $$

Pluggin that $m$ into the $k$ we got yields: $$ k=-{\frac {\ln P_{FP}}{\ln 2}} $$

And voila, $k$ is only dependent on a given false positive probability $P_{FP}$!

Summary

If I knew before hand the amount of work that would be needed to make this content accesible I would never have started it! Nevertheless, I had fun. Hope that you too.


  1. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970. [return]