Performance discrepancy between OSX and Linux for communication using Python multiprocessing Performance discrepancy between OSX and Linux for communication using Python multiprocessing linux linux

Performance discrepancy between OSX and Linux for communication using Python multiprocessing


TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux

Using Array from multiprocessing uses the C types Python library to make a C call to set memory for the Array. This takes relatively more time on Linux than on OSX. You can also observe this on OSX by using pypy. Setting memory takes much longer using pypy (and GCC and LLVM) than using python3 on OSX (using Clang).

TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes

The major difference is in the implementation of multiprocessing, which works different under OSX than in Windows. The most important difference is the way multiprocessing starts a new process. There are three ways this can be done: using spawn, fork or forkserver. The default (and only supported) way under Windows is spawn. The default way under *nix (including OSX) is fork. This is documented in the Contexts and start methods section of the multiprocessing documentation.

One other reason for the deviation in results is the low number of iterations you take.

If you increase the number of iterations and calculate the number of handled function calls per time unit, you get relatively consistent results between the three methods.

Further analysis: look at the function calls with cProfile

I removed your timeit timer functions and wrapped your code in the cProfile profiler.

I added this wrapper function:

def run_test(iters, size, func):    for _ in range(iters):        func(size)

And I replaced the loop in main() with:

for func in [test_with_array, test_with_pipe, test_with_queue]:    print(f"*** Running {func.__name__} ***")    pr = cProfile.Profile()    pr.enable()    run_test(args.iters, args.size, func)    pr.disable()    ps = pstats.Stats(pr, stream=sys.stdout)    ps.strip_dirs().sort_stats('cumtime').print_stats()

Analysis of the OSX - Linux difference with Array

What I see is that Queue is faster than Pipe, which is faster than Array. Regardsless of the platform (OSX/Linux/Windows), Queue is between 2 and 3 times faster than Pipe. On OSX and Windows, Pipe is around 1.2 and 1.5 times faster than Array. But on Linux, Pipe is around 3.6 times faster than Array. In other words, On Linux, Array is relatively much slower than on Windows and OSX. This is strange.

Using the cProfile data, I compared the performance ratio between OSX and Linux. There are two function calls that take a lot of time: Array and RawArray in sharedctypes.py. These functions are only called in the Array scenario (not in Pipe or Queue). On Linux, these calls take almost 70% of the time, while on OSX only 42% of the time. So this a major factor.

If we zoom in to the code, we see that Array (line 84) calls RawArray, and RawArray (line 54) does nothing special, except a call to ctypes.memset (documentation). So there we have a suspect. Let's test it.

The following code uses timeit to test the performance of setting 1 MB of memory buffer to 'A'.

import timeitcmds = """\import ctypess=ctypes.create_string_buffer(1024*1024)ctypes.memset(ctypes.addressof(s), 65, ctypes.sizeof(s))"""timeit.timeit(cmds, number=100000)

Running this on my MacBookPro and on my Linux server confirms the behaviour that this runs much slower on Linux than on OSX. Knowing that pypy is on OSX compiled using GCC and Apples LLVM, this is more akin to the Linux world than Python, which is on OSX compiled directly against Clang. Normally, Python programs runs faster on pypy than on CPython, but the code above runs 6.4 times slower on pypy (on the same hardware!).

My knowlegde of C toolchains and C libraries is limited, so I can't dig deeper. So my conclusion is: OSX and Windows are faster with Array because memory calls to the C library slow Array down on Linux.

Analysis of the OSX - Windows performance difference

Next I ran this on my dual-boot MacBook Pro under OSX and under Windows. The advantage is that the underlying hardware is the same; only the OS is different. I increased the number of iterations to 1000 and the size to 10.000.

The results are as follows:

  • OSX:
    • Array: 225668 calls in 10.895 seconds
    • Pipe: 209552 calls in 6.894 seconds
    • Queue: 728173 calls in 7.892 seconds
  • Windows:
    • Array: 354076 calls in 296.050 seconds
    • Pipe: 374229 calls in 234.996 seconds
    • Queue: 903705 calls in 250.966 seconds

We can see that:

  1. The Windows implementation (using spawn) takes more calls than OSX (using fork);
  2. The Windows implementation takes much more time per call than OSX.

What's not immediately evident, but relevant to note is that if you look at the average time per call, the relative pattern between the three multiprocessing methodes (Array, Queue and Pipe) is the same (see graphs below). In other words: the differences in performance between Array, Queue and Pipe in OSX and Windows can be completely explained by two factors: 1. the difference in Python performance between the two platforms; 2. the different ways both platforms handle multiprocessing.

In other words: the difference in the number of calls is explained by the Contexts and start methods section of the multiprocessing documentation. The difference in execution time is explained in the performance difference of Python between OSX and Windows. If you factor out those two components, the relative performance of Array, Queue and Pipe are (more or less) comparable on OSX and Windows, as is shown in the graphs below.

Performance differences of Array, Queue and Pipe between OSX and Windows


Well, When we talk about multi-process with python these things happens:

  • The OS does all the multi-tasking work
  • The only option for multi-core concurrency
  • Duplicated use of system resources

There are huge differences between osx and linux. and osx is based on Unix and treats multi tasking process in other way than linux.

Unix installation requires a strict and well-defined hardware machinery and works only on specific CPU machines, and maybe osx is not designed to speed up python processes. This reason may be the cause.

For more details you can read the MultiProcessing documentation.

I hope it helps.