# Bloom Filters for the Perplexed

29.07.2017, in { algo }

Bloom filters are one of those simple and handy engineering tools that any engineer should have in their toolbox.

It is a space-efficient probabilistic data structure that represents a set and allows you to test if an element is in it.

They are really simple to construct and look like this:

The cool thing is that to achieve such space-efficiency Bloom filters allow for errors with an arbitrarily small probability!

Remarkably, a Bloom filter that represents a set of $1$ million items with an error rate of $0.01$ requires only $9585059$ bits ($~1.14 \text{MB}$)! Irrespective of the length of each element in $S$!

There’re many kinds of Bloom filters. We’ll cover the Standard Bloom Filter as it is the basis for the others and also serves as a nice intro for more advanced probabilistic set-membership data structures such as Cuckoo Filter.

What’s on our menu for today?

All code snippets can be found on Github.

Academic fellows might want to review Bloom’s original $1970$ paper [1]. Strikingly, Bloom has only $3$ references in his article!

## Standard Bloom Filters

A Bloom filter represents a set $S=\{x_1, x_2, \ldots, x_n\}$ of $n$ elements by an array of $m$ bits (initially set to $0$). Lets call it $B$.

$$B = [b_1, b_2, b_3, b_4, \ldots, b_{m-1}, b_m]$$

A Bloom filter uses $k$ independent hash functions $h_1, \ldots, h_k$ that are assumed to be uniformly distributed over the range $\{1, \ldots, m \}$.

Standard Bloom filters have two operations:

Add($x$) - Add $x$ to the Bloom filter.

For $1 \leq i \leq k$, $B[h_i(x)]=1$.

To represent set $S$ by Bloom filter $B$ we need to Add all $x \in S$.

Contains($y$) - Check if $y$ is in the Bloom filter.

For $1 \leq i \leq k$, if all $B[h_i(y)]==1$ return True, otherwise False

To make things clearer, here is a hand-drawn illustration (heavily inspired by Julia Evans (@b0rk) blog illustrations):

Figure 1: $S$ has $3$ elements ($n=3$), and is represented by a Bloom filter of length $22$ ($m=22$) with $3$ hash functions ($k=3$). Contains($w$) returns False because $B[h_{orange}(w)]==0$.

### False-Negatives and False-Positives

If Contains($y$) returns False, clearly $y$ is not a member of $S$. This property is why we say Bloom filters have zero false-negatives. A situation where Contains($e$) is False and $e \in S$ simply can’t happen (because otherwise all relevant bits were $1$).

If Contains($y$) returns True, $y$ may or may not be a member of $S$. This is why we say that Bloom filters may have false-positives, because a situation where Contains($e$) is True and $e \notin S$ can occur.

### Arbitrarily Small Probability of Error (False-Positive)

The cool thing about Bloom filters is that based on $n$ (number of elements in the set $S$) and a chosen probability of false positives $P_{FP}$ we can derive optimal $k$ (number of hash functions) and $m$ (length of bit vector $B$).

These are the formulae (we derive those in the Appendix): $$k=-{\frac {\ln P_{FP}}{\ln 2}}, \enspace m=-{\frac {n\ln P_{FP}}{(\ln 2)^{2}}}$$

A very cool result is that the optimal number of hash functions $k$ depends only on the false-positive probability $P_{FP}$.

Lets play with it:

import math

def optimal_km(n, p):
ln2 = math.log(2)
lnp = math.log(p)
k = -lnp/ln2
m = -n*lnp/((ln2)**2)
return int(math.ceil(k)), int(math.ceil(m))

print(optimal_km(10**6, 0.01)) # Returns (7, 9585059)

Remarkably, a Bloom filter that represents a set of $1$ million items with a false-positive probability of $0.01$ requires only $9585059$ bits ($~1.14 \text{MB}$) and $7$ hash functions. Only $9.6$ bits per element! ($\frac{9585059}{1000000}$).

## Toy Implementation

The first thing that comes to mind is how exactly do we get a family of hash functions $h_1, \ldots, h_k$ that are uniformly distributed?

Since our objective is learning and we are not aiming for efficiency, we can simply build our family of hash functions on top of an existing hash function as a primitive. Such as SHA256.

One such construction is the following: $$h_i(s) = \text{SHA256}(i\space||\space s) \text{ mod } m$$

Where:

• $i$ - the number of the hash function.
• $s$ - the string to be hashed.
• $m$ - the length of the bit vector.
• $||$ - string concatenation.

And with Python:

def h_i(i, m, s):
return int(sha256(bytes([i]) + s).hexdigest(), 16) % m


Recall that the standard bloom filter had two simple operations:

class bloom_filter:
def __init__(self, m, k, h=h_i):
self.m = m # length of bit array
self.k = k # number of hash functions
self.h = h # hash function
self.bits = bitarray(self.m) # the actual bit store

for i in range(self.k):
self.bits[self.h(i, self.m, s)] = 1

def contains(self, s):
for i in range(self.k):
if self.bits[self.h(i, self.m, s)] == 0:
return False
return True # all bits were set


## Example: Efficiently Verify Compromised SSH Keys

In 2008, Luciano Bello discovered that Debian’s OpenSSL random number generator was seriously flawed and depended only on a process id integer.

Without getting into the gory details, essentially it means that between 2006 and 2008, cryptographic keys that were generated using OpenSSL on Debian-based operating systems were limited to 32768 different possible keys given their type (e.g. RSA) and size (e.g. 1024 bit).

Fortunately, g0tmi1k (@g0tmi1k), generated all possible compromised keys and uploaded them to a repo.

Say we’d like to build a system that allows users to query whether their keys were compromised.

Moreover say we’d like our system to use a minimal amount of bandwidth and/or requests to the server.

Using a bloom filter for that purpose fits perfectly! Lets see why:

* Note that the initial client request is just a fetch (GET) request, the actual key X isn’t sent with it.

Since Bloom filters have zero false negatives, if key $X$ isn’t in Bloom filter $B$ we know that $X$ isn’t compromised with certainty! Therefore, no more requests needed.

The other case is that $X$ is in $B$:

* Note that the initial client request is just a fetch (GET) request, the actual key X isn’t sent with it.

So a request to verify the state of $X$ must be sent to the server as it may be either a true positive ($X$ is in $B$ and $X$ is compromised) or a false positive ($X$ is in $B$ but $X$ isn’t compromised).

The nice thing about bloom filters is that the size of the Bloom filter is dependent on the false positive probability that is arbitrarily chosen by us!

So if we notice a lot of false positives we can just tune the Bloom filter to our needs and sacrifice some bandwidth for more accuracy.

In Github there’s test_bloom.py which populates a toy Bloom filter with the compromised keys and tests its effectiveness:

## Summary

Bloom filters are simple and useful beasts. They were designed in times of great scarcity in memory, but since nowadays we explode in data they fit almost perfectly for our times as well!

If you found this article interesting, make sure to read about Scalable Bloom filters which allows you to add items to an already existing Bloom filter and Counting Bloom filters which allows you to delete items from an already existing Bloom filter.

If you spot an error / have any question please let me know so others may gain :)

Discussion on Hacker News.