Bloomers

Since I thought a Bloom filter might be to blame in Mac OS X’s spellchecker, I made a little Processing app to demonstrate how exactly a Bloom filter works.

First, a brief description of what Bloom filters are. They’re a data structure that relies on hash functions for representing a set. The actual structure is a bunch of bits and a few hash functions. To add an element to a Bloom filter, compute the hashes for each hash function, and set each bit high. To test if an element has been added to the Bloom filter, compute the hashes and return true if all the bits are already high. A quick overview of what this means:

Pros:

You don’t store any elements in the Bloom filter—just the elements’ hashes. Very compact.
Lookup and adding to the set are extremely fast.
Union and intersection of Bloom filter sets are just bitwise operators.

Cons:

False positives are possible. (Since we don’t store the actual elements, we might see all the hashed bits as high when it was actually one or more previously added elements that made them high.)
There’s no way to remove an element from the set. (We don’t know if another element has hashed to the bits we want to set low.)

Without further ado, here’s the app. (It’s a little rough around the edges, and, yes, it’s a Java applet—Processing.js has a few bugs and is slow as balls for this type of application, apparently.) This is a Bloom filter with 625 bits and 10 hash functions. (More implementation details below.) Have fun—ask questions or leave feedback in the comments!

Directions:

Give the applet focus by clicking on the dots.
Start typing into it and you’ll see the bits your string is hashing to in real time.
Hit enter to add the string and you’ll see the high bits turn red. Repeat!

For those curious, to get my ten hash functions, I’m double hashing two string hash functions I found: djb2 and sdbm.

For the even more curious, here’s the Processing source.

Update: Added more explicit directions.

Categories:

No TrackBacks

2 Comments

Scavenge

Syndication

Recent Trash

Recent ♻