passing numpy arrays through multiprocessing.Queue passing numpy arrays through multiprocessing.Queue numpy numpy

passing numpy arrays through multiprocessing.Queue


The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.

import numpyimport cPickle as pickleN = 1000a0 = pickle.dumps(numpy.zeros(N))a1 = pickle.dumps(numpy.zeros(N), protocol=-1)print "a0", len(a0)   # 32155print "a1", len(a1)   #  8133

Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).

On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.


Share Large, Read-Only Numpy Array Between Multiprocessing Processes

That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.