Modern, high performance bloom filter in Python? [closed] Modern, high performance bloom filter in Python? [closed] python python

Modern, high performance bloom filter in Python? [closed]


I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

In terms of non-Java bit array implementations:

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

from BitVector import BitVectorfrom random import Random# get hashes from http://www.partow.net/programming/hashfunctions/index.htmlfrom hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash## ryan.a.cox@gmail.com / www.asciiarmor.com## copyright (c) 2008, ryan cox# all rights reserved # BSD license: http://www.opensource.org/licenses/bsd-license.php#class BloomFilter(object):    def __init__(self, n=None, m=None, k=None, p=None, bits=None ):        self.m = m        if k > 4 or k < 1:            raise Exception('Must specify value of k between 1 and 4')        self.k = k        if bits:            self.bits = bits        else:            self.bits = BitVector( size=m )        self.rand = Random()        self.hashes = []        self.hashes.append(RSHash)        self.hashes.append(JSHash)        self.hashes.append(PJWHash)        self.hashes.append(DJBHash)        # switch between hashing techniques        self._indexes = self._rand_indexes        #self._indexes = self._hash_indexes    def __contains__(self, key):        for i in self._indexes(key):             if not self.bits[i]:                return False            return True     def add(self, key):        dupe = True         bits = []        for i in self._indexes(key):             if dupe and not self.bits[i]:                dupe = False            self.bits[i] = 1            bits.append(i)        return dupe    def __and__(self, filter):        if (self.k != filter.k) or (self.m != filter.m):             raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')        return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))    def __or__(self, filter):        if (self.k != filter.k) or (self.m != filter.m):             raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')        return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))    def _hash_indexes(self,key):        ret = []        for i in range(self.k):            ret.append(self.hashes[i](key) % self.m)        return ret    def _rand_indexes(self,key):        self.rand.seed(hash(key))        ret = []        for i in range(self.k):            ret.append(self.rand.randint(0,self.m-1))        return retif __name__ == '__main__':    e = BloomFilter(m=100, k=4)    e.add('one')    e.add('two')    e.add('three')    e.add('four')    e.add('five')            f = BloomFilter(m=100, k=4)    f.add('three')    f.add('four')    f.add('five')    f.add('six')    f.add('seven')    f.add('eight')    f.add('nine')    f.add("ten")            # test check for dupe on add    assert not f.add('eleven')     assert f.add('eleven')     # test membership operations    assert 'ten' in f     assert 'one' in e     assert 'ten' not in e     assert 'one' not in f             # test set based operations    union = f | e    intersection = f & e    assert 'ten' in union    assert 'one' in union     assert 'three' in intersection    assert 'ten' not in intersection    assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

def fast_count_bits( self, v ):    bits = (            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,            4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )    return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]


In reaction to Parand, saying "common practice seems to be using something like SHA1 and split up the bits to form multiple hashes", while that may be true in the sense that it's common practice (PyBloom also uses it), it still doesn't mean it's the right thing to do ;-)

For a Bloom filter, the only requirement a hash function has is that its output space must be uniformly distributed given the expected input. While a cryptographic hash certainly fulfils this requirement, it's also a little bit like shooting a fly with a bazooka.

Instead try the FNV Hash which uses just one XOR and one multiplication per input byte, which I estimate is a few hundred times faster than SHA1 :)

The FNV hash is not cryptographically secure, but you don't need it to be. It has slightly imperfect avalanche behaviour, but you're not using it for integrity checking either.

About uniformity, note that the second link only did a Chi-square test for the 32-bit FNV hash. It's better to use more bits and the FNV-1 variant, which swaps the XOR and the MUL steps for better bit-dispersion. For a Bloom Filter, there's a few more catches, such as mapping the output uniformly to the index range of the bit-array. If possible, I'd say round up the size of the bit-array to the nearest power of 2 and adjust k accordingly. That way you get better accuracy and you can use simple XOR-folding to map the range.

Additionally, here's a reference explaining why you don't want SHA1 (or any cryptographic hash) when you need a general purpose hash.


Eventually I found pybloomfiltermap. I haven't used it, but it looks like it'd fit the bill.