numpy vs. multiprocessing and mmap numpy vs. multiprocessing and mmap numpy numpy

numpy vs. multiprocessing and mmap


My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice]) (data[yourslice].copy() doesn't actually do this, which can lead to some confusion.).

First off, let's generate some test data:

import numpy as npnp.random.random(10000).tofile('data.dat')

You can reproduce your errors with something like this:

import numpy as npimport multiprocessingdef main():    data = np.memmap('data.dat', dtype=np.float, mode='r')    pool = multiprocessing.Pool()    results = pool.imap(calculation, chunks(data))    results = np.fromiter(results, dtype=np.float)def chunks(data, chunksize=100):    """Overly-simple chunker..."""    intervals = range(0, data.size, chunksize) + [None]    for start, stop in zip(intervals[:-1], intervals[1:]):        yield data[start:stop]def calculation(chunk):    """Dummy calculation."""    return chunk.mean() - chunk.std()if __name__ == '__main__':    main()

And if you just switch to yielding np.array(data[start:stop]) instead, you'll fix the problem:

import numpy as npimport multiprocessingdef main():    data = np.memmap('data.dat', dtype=np.float, mode='r')    pool = multiprocessing.Pool()    results = pool.imap(calculation, chunks(data))    results = np.fromiter(results, dtype=np.float)def chunks(data, chunksize=100):    """Overly-simple chunker..."""    intervals = range(0, data.size, chunksize) + [None]    for start, stop in zip(intervals[:-1], intervals[1:]):        yield np.array(data[start:stop])def calculation(chunk):    """Dummy calculation."""    return chunk.mean() - chunk.std()if __name__ == '__main__':    main()

Of course, this does make an extra in-memory copy of each chunk.

In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py, but pyTables is nice if your data is "table-like".)

Good luck, at any rate!