multiprocess or multithread? - parallelizing a simple computation for millions of iterations and storing the result in a single data structure

python multithreading numpy multiprocess

First option - a Server Process

Create a Server process. It's part of the Multiprocessing package which allows parallel access to data structures. This way every process will access the data structure directly, locking other processes.

From the documentation:

Server process
A manager object returned by Manager() controls a server process whichholds Python objects and allows other processes to manipulate themusing proxies.
A manager returned by Manager() will support types list, dict,Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event,Queue, Value and Array.

Second option - Pool of workers

Create a Pool of workers, an input Queue and a result Queue.

The main process, acting as a producer, will feed the input queue with pairs (s1, s2).
Each worker process will read a pair from the input Queue, and write the result into the output Queue.
The main thread will read the results from the result Queue, and write them into the result dictionary.

Third option - divide to independent problems

Your data is independent: f( D[s_i],D[s_j] ) is a secluded problem, independent of any f( D[s_k],D[s_l] ) . furthermore, the computation time of each pair should be fairly equal, or at least in the same order of magnitude.

Divide the task into n inputs sets, where n is the number of computation units (cores, or even computers) you have. Give each input set to a different process, and join the output.

python multithreading numpy multiprocess

You definitely won't get any performance increase with threading - it is an inappropriate tool for cpu-bound tasks.

So the only possible choice is multiprocessing, but since you have a big data structure, I'd suggest something like mmap (pretty low level, but builtin) or Redis (tasty and high level API, but should be installed and configured).

python multithreading numpy multiprocess

Have you profiled your code? Is it just calculating f that is too expensive, or storing the results in the data structure (or maybe both)?

If f is dominant, then you should make sure you can't make algorithmic improvements before you start worrying about parallelization. You might be able to get a big speed up by turning some or all of the function into a C extension, perhaps using cython. If you do go with multiprocessing, then I don't see why you need to pass the entire data structure between processes?

If storing results in the matrix is too expensive, you might speed up your code by using a more efficient data structure (like array.array or numpy.ndarray). Unless you have been very careful designing and implementing your custom matrix class, it will almost certainly be slower than those.

CodeHunter

multiprocess or multithread? - parallelizing a simple computation for millions of iterations and storing the result in a single data structure

First option - a Server Process

Second option - Pool of workers

Third option - divide to independent problems

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last