Efficient serialization of numpy boolean arrays

python serialization dictionary numpy

I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray objects are mutable, and their hashes are content-independent (i.e. for bitarray b, hash(b) == id(b)). This can be worked around, as aix's answer shows, but in fact you don't need bitarrays at all -- you can just use tostring!

In [1]: import numpyIn [2]: a = numpy.arange(25).reshape((5, 5))In [3]: (a > 10).tostring()Out[3]: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01         \x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01'

Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring-style serialization.

In [4]: len((a > 10).tostring())Out[4]: 25

Converting back is easy and fast:

In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)Out[5]: array([[False, False, False, False, False],       [False, False, False, False, False],       [False,  True,  True,  True,  True],       [ True,  True,  True,  True,  True],       [ True,  True,  True,  True,  True]], dtype=bool)In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)100000 loops, best of 3: 5.75 us per loop

Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle seems like a good choice though. Still, its output is 10x as big...

In [7]: import cPickleIn [8]: len(cPickle.dumps(a > 10))Out[8]: 255

It's also slower:

In [9]: cPickle.loads(cPickle.dumps(a > 10))Out[9]: array([[False, False, False, False, False],       [False, False, False, False, False],       [False,  True,  True,  True,  True],       [ True,  True,  True,  True,  True],       [ True,  True,  True,  True,  True]], dtype=bool)In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))10000 loops, best of 3: 45.8 us per loop

My third suggestion uses bitstrings -- specifically, bitstring.ConstBitArray. It's similar in spirit to aix's solution, but ConstBitArrays are immutable, so they do what you want, hash-wise.

In [11]: import bitstring

You have to flatten the numpy array explicitly:

In [12]: b = bitstring.ConstBitArray((a > 10).flat)In [13]: b.binOut[13]: '0b0000000000011111111111111'

It's immutable so it hashes well:

In [14]: hash(b)Out[14]: 12144

It's super-easy to convert back into an array, but again, shape information is lost.

In [15]: numpy.array(b).reshape(5, 5)Out[15]: array([[False, False, False, False, False],       [False, False, False, False, False],       [False,  True,  True,  True,  True],       [ True,  True,  True,  True,  True],       [ True,  True,  True,  True,  True]], dtype=bool)

It's also the slowest option by far:

In [16]: %timeit numpy.array(b).reshape(5, 5)1000 loops, best of 3: 240 us per loop

Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray is way faster than bitstring when you use it right:

In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)1000 loops, best of 3: 283 us per loopIn [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)10000 loops, best of 3: 19.9 us per loop

Second, as you can see from the above, all the tostring shenanigans are unnecessary; you could also just explicitly flatten the numpy array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.

So here's a full rundown of the results. First, definitions:

small_nda = numpy.arange(25).reshape(5, 5) > 10big_nda = numpy.arange(10000).reshape(100, 100) > 5000small_barray = bitarray.bitarray(small_nda.flat)big_barray = bitarray.bitarray(big_nda.flat)small_bstr = bitstring.ConstBitArray(small_nda.flat)big_bstr = bitstring.ConstBitArray(big_nda.flat)

keysize is the result of sys.getsizeof({small|big}_nda.tostring()), sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring()), or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes()) -- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.

speed is the time it takes to convert from {small|big}_nda to a key and back, plus the time it takes to convert a bitarray object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.

         small_nda   big_nda   small_barray   big_barray   small_bstr   big_bstrkeysize  64          10040     148            1394         100          1346speed    2.05 us     3.15 us   3.81 us        96.3 us      277 us       92.2ms                               + 161 ns       + 257 ns

As you can see, bitarray is impressively fast, and aix's suggestion of a subclass of bitarray should work well. Certainly it's a lot faster than bitstring. Glad to see that you accepted that answer.

On the other hand, I still feel attached to the numpy.array.tostring() method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.

python serialization dictionary numpy

Initially, I suggested using bitarray. However, as rightly pointed out by @senderle, since bitarray is mutable, it can't be used to directly key into a dict.

Here is a revised solution (still based on bitarray internally):

import bitarrayclass BoolArray(object):  # create from an ndarray  def __init__(self, array):    ba = bitarray.bitarray()    ba.pack(array.tostring())    self.arr = ba.tostring()    self.shape = array.shape    self.size = array.size  # convert back to an ndarray  def to_array(self):    ba = bitarray.bitarray()    ba.fromstring(self.arr)    ret = np.fromstring(ba.unpack(), dtype=np.bool)[:self.size]    return ret.reshape(self.shape)  def __cmp__(self, other):    return cmp(self.arr, other.arr)  def __hash__(self):    return hash(self.arr)import numpy as npx = (np.random.random((2,3,2))>0.5)b1 = BoolArray(x)b2 = BoolArray(x)d = {b1: 12}d[b2] += 1print dprint b1.to_array()

This works with Python 2.5+, requires one bit per array element and supports arrays of any shape/dimensions.

EDIT: In the recent versions, you have to replace the ba.tostring and ba.fromstring to ba.tobytes and ba.frombytes (Deprecated since version 0.4.0).

python serialization dictionary numpy

I would convert the array to an bitfield using np.packbits. This is fairly memory efficient, it uses all the bits of a byte. Still the code is relatively simple.

import numpy as nparray=np.array([True,False]*20)Hash=np.packbits(array).tostring()dict={}dict[Hash]=10print(np.unpackbits(np.fromstring(Hash,np.uint8)).astype(np.bool)[:len((array)])

Be careful with variable length bool arrays the code does not distinguish between an all False array of for example 6 or 7 members. For moredimensional arrays you will need some reshaping..

If this is still not efficient enough, and your arrays are large, you might be able to reduce the memory further by packing:

import bz2Hash_compressed=bz2.compress(Hash,1)

It does not work for random, uncompressible data though

CodeHunter

Efficient serialization of numpy boolean arrays

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last