Multiprocessing.Pool makes Numpy matrix multiplication slower Multiprocessing.Pool makes Numpy matrix multiplication slower numpy numpy

Multiprocessing.Pool makes Numpy matrix multiplication slower


Regarding the fact that all of your processes are running on the same CPU, see my answer here.

During import, numpy changes the CPU affinity of the parent process, such that when you later use Pool all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.

You can call taskset after you import numpy to reset the CPU affinity so that all cores are used:

import numpy as npimport osfrom timeit import timeitfrom multiprocessing import Pooldef mmul(matrix):    for i in range(100):        matrix = matrix * matrix    return matrixif __name__ == '__main__':    matrices = []    for i in range(4):        matrices.append(np.random.random_integers(100, size=(1000, 1000)))    print timeit(lambda: map(mmul, matrices), number=20)    # after importing numpy, reset the CPU affinity of the parent process so    # that it will use all cores    os.system("taskset -p 0xff %d" % os.getpid())    pool = Pool(8)    print timeit(lambda: pool.map(mmul, matrices), number=20)

Output:

    $ python tmp.py                                         12.4765810966    pid 29150's current affinity mask: 1    pid 29150's new affinity mask: ff    13.4136221409

If you watch CPU useage using top while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.

Edit: I suspect that part of the reason why the single process seems to be consistently faster is that numpy may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.

For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:

import numpy as npimport osfrom timeit import timeitfrom multiprocessing import Pooldef fib(dummy):    n = [1,1]    for ii in xrange(100000):        n.append(n[-1]+n[-2])def silly_mult(matrix):    for row in matrix:        for val in row:            val * valif __name__ == '__main__':    dt = timeit(lambda: map(fib, xrange(10)), number=10)    print "Fibonacci, non-parallel: %.3f" %dt    matrices = [np.random.randn(1000,1000) for ii in xrange(10)]    dt = timeit(lambda: map(silly_mult, matrices), number=10)    print "Silly matrix multiplication, non-parallel: %.3f" %dt    # after importing numpy, reset the CPU affinity of the parent process so    # that it will use all CPUS    os.system("taskset -p 0xff %d" % os.getpid())    pool = Pool(8)    dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)    print "Fibonacci, parallel: %.3f" %dt    dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)    print "Silly matrix multiplication, parallel: %.3f" %dt

Output:

$ python tmp.pyFibonacci, non-parallel: 32.449Silly matrix multiplication, non-parallel: 40.084pid 29528's current affinity mask: 1pid 29528's new affinity mask: ffFibonacci, parallel: 9.462Silly matrix multiplication, parallel: 12.163


The unpredictable competition between communication overhead and computation speedup is definitely the issue here. What you are observing is perfectly fine. Whether you get a net speed-up depends on many factors and is something that has to be quantified properly (as you did).

So why is multiprocessing so "unexpectedly slow" in your case? multiprocessing's map and map_async functions actually pickle Python objects back and forth through pipes that connect the parent with the child processes. This may take a considerable amount of time. During that time, the child processes have almost nothing to do, which is what to see in htop. Between different systems, there might be a considerable pipe transport performance difference, which is also why for some people your pool code is faster than your single CPU code, although for you it is not (other factors might come into play here, this is just an example in order to explain the effect).

What can you do to make it faster?

  1. Don't pickle the input on POSIX-compliant systems.

    If you are on Unix, you can get around the parent->child communication overhead via taking advantage of POSIX' process fork behavior (copy memory on write):

    Create your job input (e.g. a list of large matrices) to work on in the parent process in a globally accessible variable. Then create worker processes by calling multiprocessing.Process() yourself. In the children, grab the job input from the global variable. Simply expressed, this makes the child access the memory of the parent without any communication overhead (*, explanation below). Send the result back to the parent, through e.g. a multiprocessing.Queue. This will save a lot of communication overhead, especially if the output is small compared to the input. This method won't work on e.g. Windows, because multiprocessing.Process() there creates an entirely new Python process that does not inherit the state of the parent.

  2. Make use of numpy multithreading.Depending on your actual calculation task, it might happen that involving multiprocessing won't help at all. If you compile numpy yourself and enable OpenMP directives, then operations on larges matrices might become very efficiently multithreaded (and distributed over many CPU cores; the GIL is no limiting factor here) by themselves. Basically, this is the most efficient usage of multiple CPU cores you can get in the context of numpy/scipy.

*The child cannot directly access the parent's memory in general. However, after fork(), parent and child are in an equivalent state. It would be stupid to copy the entire memory of the parent to another place in the RAM. That's why the copy-on-write principle jumps in. As long as the child does not change its memory state, it actually accesses the parent's memory. Only upon modification, the corresponding bits and pieces are copied into the memory space of the child.

Major edit:

Let me add a piece of code that crunches a large amount of input data with multiple worker processes and follows the advice "1. Don't pickle the input on POSIX-compliant systems.". Furthermore, the amount of information transferred back to the worker manager (the parent process) is quite low. The heavy computation part of this example is a single value decomposition. It can make heavy use of OpenMP. I have executed the example multiple times:

  • Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=1, so each worker process creates a maximum load of 100 %. There, the mentioned number-of-workers-compute-time scaling behavior is almost linear and the net speedup factor up corresponds to the number of workers involved.
  • Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=4, so that each process creates a maximum load of 400 % (via spawning 4 OpenMP threads). My machine has 16 real cores, so 4 processes with max 400 % load each will almost get the maximum performance out of the machine. The scaling is not perfectly linear anymore and the speedup factor is not the number of workers involved, but the absolute calculation time becomes significantly reduced compared to OMP_NUM_THREADS=1 and time still decreases significantly with the number of worker processes.
  • Once with larger input data, 4 cores, and OMP_NUM_THREADS=4. It results in an average system load of 1253 %.
  • Once with same setup as last, but OMP_NUM_THREADS=5. It results in an average system load of 1598 %, which suggests that we got everything from that 16 core machine. However, the actual computation wall time does not improve compared to the latter case.

The code:

import osimport timeimport mathimport numpy as npfrom numpy.linalg import svd as svdimport multiprocessing# If numpy is compiled for OpenMP, then make sure to control# the number of OpenMP threads via the OMP_NUM_THREADS environment# variable before running this benchmark.MATRIX_SIZE = 1000MATRIX_COUNT = 16def rnd_matrix():    offset = np.random.randint(1,10)    stretch = 2*np.random.rand()+0.1    return offset + stretch * np.random.rand(MATRIX_SIZE, MATRIX_SIZE)print "Creating input matrices in parent process."# Create input in memory. Children access this input.INPUT = [rnd_matrix() for _ in xrange(MATRIX_COUNT)]def worker_function(result_queue, worker_index, chunk_boundary):    """Work on a certain chunk of the globally defined `INPUT` list.    """    result_chunk = []    for m in INPUT[chunk_boundary[0]:chunk_boundary[1]]:        # Perform single value decomposition (CPU intense).        u, s, v = svd(m)        # Build single numeric value as output.        output =  int(np.sum(s))        result_chunk.append(output)    result_queue.put((worker_index, result_chunk))def work(n_workers=1):    def calc_chunksize(l, n):        """Rudimentary function to calculate the size of chunks for equal         distribution of a list `l` among `n` workers.        """        return int(math.ceil(len(l)/float(n)))    # Build boundaries (indices for slicing) for chunks of `INPUT` list.    chunk_size = calc_chunksize(INPUT, n_workers)    chunk_boundaries = [        (i, i+chunk_size) for i in xrange(0, len(INPUT), chunk_size)]    # When n_workers and input list size are of same order of magnitude,    # the above method might have created less chunks than workers available.     if n_workers != len(chunk_boundaries):        return None    result_queue = multiprocessing.Queue()    # Prepare child processes.    children = []    for worker_index in xrange(n_workers):        children.append(            multiprocessing.Process(                target=worker_function,                args=(                    result_queue,                    worker_index,                    chunk_boundaries[worker_index],                    )                )            )    # Run child processes.    for c in children:        c.start()    # Create result list of length of `INPUT`. Assign results upon arrival.    results = [None] * len(INPUT)    # Wait for all results to arrive.    for _ in xrange(n_workers):        worker_index, result_chunk = result_queue.get(block=True)        chunk_boundary = chunk_boundaries[worker_index]        # Store the chunk of results just received to the overall result list.        results[chunk_boundary[0]:chunk_boundary[1]] = result_chunk    # Join child processes (clean up zombies).    for c in children:        c.join()    return resultsdef main():    durations = []    n_children = [1, 2, 4]    for n in n_children:        print "Crunching input with %s child(ren)." % n        t0 = time.time()        result = work(n)        if result is None:            continue        duration = time.time() - t0        print "Result computed by %s child process(es): %s" % (n, result)        print "Duration: %.2f s" % duration        durations.append(duration)    normalized_durations = [durations[0]/d for d in durations]    for n, normdur in zip(n_children, normalized_durations):        print "%s-children speedup: %.2f" % (n, normdur)if __name__ == '__main__':    main()

The output:

$ export OMP_NUM_THREADS=1$ /usr/bin/time python test2.py Creating input matrices in parent process.Crunching input with 1 child(ren).Result computed by 1 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]Duration: 16.66 sCrunching input with 2 child(ren).Result computed by 2 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]Duration: 8.27 sCrunching input with 4 child(ren).Result computed by 4 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]Duration: 4.37 s1-children speedup: 1.002-children speedup: 2.024-children speedup: 3.8148.75user 1.75system 0:30.00elapsed 168%CPU (0avgtext+0avgdata 1007936maxresident)k0inputs+8outputs (1major+809308minor)pagefaults 0swaps$ export OMP_NUM_THREADS=4$ /usr/bin/time python test2.py Creating input matrices in parent process.Crunching input with 1 child(ren).Result computed by 1 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]Duration: 8.62 sCrunching input with 2 child(ren).Result computed by 2 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]Duration: 4.92 sCrunching input with 4 child(ren).Result computed by 4 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]Duration: 2.95 s1-children speedup: 1.002-children speedup: 1.754-children speedup: 2.92106.72user 3.07system 0:17.19elapsed 638%CPU (0avgtext+0avgdata 1022240maxresident)k0inputs+8outputs (1major+841915minor)pagefaults 0swaps$ /usr/bin/time python test2.py Creating input matrices in parent process.Crunching input with 4 child(ren).Result computed by 4 child process(es): [21762, 26806, 10148, 22947, 20900, 8161, 20168, 17439, 23497, 26360, 6789, 11216, 12769, 23022, 26221, 20480, 19140, 13757, 23692, 19541, 24644, 21251, 21000, 21687, 32187, 5639, 23314, 14678, 18289, 12493, 29766, 14987, 12580, 17988, 20853, 4572, 16538, 13284, 18612, 28617, 19017, 23145, 11183, 21018, 10922, 11709, 27895, 8981]Duration: 12.69 s4-children speedup: 1.00174.03user 4.40system 0:14.23elapsed 1253%CPU (0avgtext+0avgdata 2887456maxresident)k0inputs+8outputs (1major+1211632minor)pagefaults 0swaps$ export OMP_NUM_THREADS=5$ /usr/bin/time python test2.py Creating input matrices in parent process.Crunching input with 4 child(ren).Result computed by 4 child process(es): [19528, 17575, 21792, 24303, 6352, 22422, 25338, 18183, 15895, 19644, 20161, 22556, 24657, 30571, 13940, 18891, 10866, 21363, 20585, 15289, 6732, 10851, 11492, 29146, 12611, 15022, 18967, 25171, 10759, 27283, 30413, 14519, 25456, 18934, 28445, 12768, 28152, 24055, 9285, 26834, 27731, 33398, 10172, 22364, 12117, 14967, 18498, 8111]Duration: 13.08 s4-children speedup: 1.00230.16user 5.98system 0:14.77elapsed 1598%CPU (0avgtext+0avgdata 2898640maxresident)k0inputs+8outputs (1major+1219611minor)pagefaults 0swaps


Your code is correct. I just ran it my system (with 2 cores, hyperthreading) and obtained the following results:

$ python test_multi.py 30.862380981419.3914041519

I looked at the processes and, as expected, the parallel part showing several processes working at near 100%. This must be something in your system or python installation.