numpy vs. multiprocessing and mmap

python numpy multiprocessing mmap

My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice]) (data[yourslice].copy() doesn't actually do this, which can lead to some confusion.).

First off, let's generate some test data:

import numpy as npnp.random.random(10000).tofile('data.dat')

You can reproduce your errors with something like this:

import numpy as npimport multiprocessingdef main():    data = np.memmap('data.dat', dtype=np.float, mode='r')    pool = multiprocessing.Pool()    results = pool.imap(calculation, chunks(data))    results = np.fromiter(results, dtype=np.float)def chunks(data, chunksize=100):    """Overly-simple chunker..."""    intervals = range(0, data.size, chunksize) + [None]    for start, stop in zip(intervals[:-1], intervals[1:]):        yield data[start:stop]def calculation(chunk):    """Dummy calculation."""    return chunk.mean() - chunk.std()if __name__ == '__main__':    main()

And if you just switch to yielding np.array(data[start:stop]) instead, you'll fix the problem:

import numpy as npimport multiprocessingdef main():    data = np.memmap('data.dat', dtype=np.float, mode='r')    pool = multiprocessing.Pool()    results = pool.imap(calculation, chunks(data))    results = np.fromiter(results, dtype=np.float)def chunks(data, chunksize=100):    """Overly-simple chunker..."""    intervals = range(0, data.size, chunksize) + [None]    for start, stop in zip(intervals[:-1], intervals[1:]):        yield np.array(data[start:stop])def calculation(chunk):    """Dummy calculation."""    return chunk.mean() - chunk.std()if __name__ == '__main__':    main()

Of course, this does make an extra in-memory copy of each chunk.

In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py, but pyTables is nice if your data is "table-like".)

Good luck, at any rate!

CodeHunter

numpy vs. multiprocessing and mmap

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last