How to speed up communication with subprocesses How to speed up communication with subprocesses multithreading multithreading

How to speed up communication with subprocesses


I think you are just being mislead by the way cProfile works. For example, here's a simple script that uses two threads:

#!/usr/bin/pythonimport threadingimport timedef f():    time.sleep(10)def main():    t = threading.Thread(target=f)    t.start()    t.join()

If I test this using cProfile, here's what I get:

>>> import test>>> import cProfile>>> cProfile.run('test.main()')         60 function calls in 10.011 seconds   Ordered by: standard name   ncalls  tottime  percall  cumtime  percall filename:lineno(function)        1    0.000    0.000   10.011   10.011 <string>:1(<module>)        1    0.000    0.000   10.011   10.011 test.py:10(main)        1    0.000    0.000    0.000    0.000 threading.py:1008(daemon)        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)        2    0.000    0.000    0.000    0.000 threading.py:241(Condition)        2    0.000    0.000    0.000    0.000 threading.py:259(__init__)        2    0.000    0.000    0.000    0.000 threading.py:293(_release_save)        2    0.000    0.000    0.000    0.000 threading.py:296(_acquire_restore)        2    0.000    0.000    0.000    0.000 threading.py:299(_is_owned)        2    0.000    0.000   10.011    5.005 threading.py:308(wait)        1    0.000    0.000    0.000    0.000 threading.py:541(Event)        1    0.000    0.000    0.000    0.000 threading.py:560(__init__)        2    0.000    0.000    0.000    0.000 threading.py:569(isSet)        4    0.000    0.000    0.000    0.000 threading.py:58(__init__)        1    0.000    0.000    0.000    0.000 threading.py:602(wait)        1    0.000    0.000    0.000    0.000 threading.py:627(_newname)        5    0.000    0.000    0.000    0.000 threading.py:63(_note)        1    0.000    0.000    0.000    0.000 threading.py:656(__init__)        1    0.000    0.000    0.000    0.000 threading.py:709(_set_daemon)        1    0.000    0.000    0.000    0.000 threading.py:726(start)        1    0.000    0.000   10.010   10.010 threading.py:911(join)       10   10.010    1.001   10.010    1.001 {method 'acquire' of 'thread.lock' objects}        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}        4    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}        4    0.000    0.000    0.000    0.000 {thread.allocate_lock}        2    0.000    0.000    0.000    0.000 {thread.get_ident}        1    0.000    0.000    0.000    0.000 {thread.start_new_thread}

As you can see, it says that almost all of the time is spent acquiring locks. Of course, we know that's not really an accurate representation of what the script was doing. All the time was actually spent in a time.sleep call inside f(). The high tottime of the acquire call is just because join was waiting for f to finish, which means it had to sit and wait to acquire a lock. However, cProfile doesn't show any time being spent in f at all. We can clearly see what is actually happening because the example code is so simple, but in a more complicated program, this output is very misleading.

You can get more reliable results by using another profiling library, like yappi:

>>> import test>>> import yappi>>> yappi.set_clock_type("wall")>>> yappi.start()>>> test.main()>>> yappi.get_func_stats().print_all()Clock type: wallOrdered by: totaltime, descname                                    #n         tsub      ttot      tavg<stdin>:1 <module>                      2/1        0.000025  10.00801  5.004003test.py:10 main                         1          0.000060  10.00798  10.00798..2.7/threading.py:308 _Condition.wait  2          0.000188  10.00746  5.003731..thon2.7/threading.py:911 Thread.join  1          0.000039  10.00706  10.00706..ython2.7/threading.py:752 Thread.run  1          0.000024  10.00682  10.00682test.py:6 f                             1          0.000013  10.00680  10.00680..hon2.7/threading.py:726 Thread.start  1          0.000045  0.000608  0.000608..thon2.7/threading.py:602 _Event.wait  1          0.000029  0.000484  0.000484..2.7/threading.py:656 Thread.__init__  1          0.000064  0.000250  0.000250..on2.7/threading.py:866 Thread.__stop  1          0.000025  0.000121  0.000121..lib/python2.7/threading.py:541 Event  1          0.000011  0.000101  0.000101..python2.7/threading.py:241 Condition  2          0.000025  0.000094  0.000047..hreading.py:399 _Condition.notifyAll  1          0.000020  0.000090  0.000090..2.7/threading.py:560 _Event.__init__  1          0.000018  0.000090  0.000090..thon2.7/encodings/utf_8.py:15 decode  2          0.000031  0.000071  0.000035..threading.py:259 _Condition.__init__  2          0.000064  0.000069  0.000034..7/threading.py:372 _Condition.notify  1          0.000034  0.000068  0.000068..hreading.py:299 _Condition._is_owned  3          0.000017  0.000040  0.000013../threading.py:709 Thread._set_daemon  1          0.000018  0.000035  0.000035..ding.py:293 _Condition._release_save  2          0.000019  0.000033  0.000016..thon2.7/threading.py:63 Thread._note  7          0.000020  0.000020  0.000003..n2.7/threading.py:1152 currentThread  2          0.000015  0.000019  0.000009..g.py:296 _Condition._acquire_restore  2          0.000011  0.000017  0.000008../python2.7/threading.py:627 _newname  1          0.000014  0.000014  0.000014..n2.7/threading.py:58 Thread.__init__  4          0.000013  0.000013  0.000003..threading.py:1008 _MainThread.daemon  1          0.000004  0.000004  0.000004..hon2.7/threading.py:569 _Event.isSet  2          0.000003  0.000003  0.000002

With yappi, it's much easier to see that the time is being spent in f.

I suspect that you'll find that in reality, most of your script's time is spent doing whatever work is being done in produceA, produceB, and produceC.


TL;DR If your program runs slower than expected, it is probably due to the details of what the intermediate functions do rather than due to IPC or threading. Test with mock functions and processes (as simple as possible) to isolate just the overhead of passing data to/from subprocesses. In a benchmark based closely on your code (below), the performance when passing data to/from subprocesses seems to be roughly equivalent to using shell pipes directly; python is not particularly slow at this task.

What is going on with the original code

The general form of the original code is:

def produceB(from_stream, to_stream):    while True:        buf = from_stream.read()        processed_buf = do_expensive_calculation(buf)        to_stream.write(processed_buf)

Here the calculation between read and write takes about 2/3 of the total cpu time of all processes (main and sub) combined - this is cpu time, not wall clock time btw.

I think that this prevents the I/O from running at full speed. Reads and writes and the calculation each need to have their own thread, with queues to provide buffering between the read and calculation and between the calculation and write (since the amount of buffering that pipes provide is insufficient, I believe).

I show below that if there is no processing in between read and write (or equivalently: if the intermediate processing is done in separate thread), then the throughput from threads + subprocess is very high. It is also possible to have separate threads for reads and writes; this adds a bit of overhead but makes writes not block reads and vice versa. Three threads (read, write and processing) is even better, then neither step blocks the others (within the limits of the queue sizes, of course).

Some benchmarks

All benchmarking below is on python 2.7.6 on Ubuntu 14.04LTS 64bit (Intel i7, Ivy Bridge, quad core). The test is to transfer approx 1GB of data in 4KB blocks between two dd processes, and pass the data through python as an intermediary. The dd processes use medium sized (4KB) blocks; typical text I/O would be smaller (unless it is cleverly buffered by the interpreter, etc), typical binary I/O would of course be much larger. I have one example based on exactly how you did this, and one example based on an alternate approach I had tried some time ago (which turns out to be slower). By the way, thanks for posting this question, it is useful.

Threads and blocking I/O

First, let's convert the original code in the question into a slightly simpler self-contained example. This is just two processes communicating with a thread that pumps data from one to the other, doing blocking reads and writes.

import subprocess, threadingA_process = subprocess.Popen(["dd", "if=/dev/zero", "bs=4k", "count=244140"], stdout=subprocess.PIPE)B_process = subprocess.Popen(["dd", "of=/dev/null", "bs=4k"], stdin=subprocess.PIPE)def convert_A_to_B(src, dst):    read_size = 8*1024    while True:        try:            buf = src.read(read_size)            if len(buf) == 0:  # This is a bit hacky, but seems to reliably happen when the src is closed                break            dst.write(buf)        except ValueError as e: # Reading or writing on a closed fd causes ValueError, not IOError            print str(e)            breakconvert_A_to_B_thread = threading.Thread(target=convert_A_to_B, args=(A_process.stdout, B_process.stdin))convert_A_to_B_thread.start()# Here, watch out for the exact sequence to clean things upconvert_A_to_B_thread.join()A_process.wait()B_process.stdin.close()B_process.wait()

Results:

244140+0 records in244140+0 records out999997440 bytes (1.0 GB) copied, 0.638977 s, 1.6 GB/s244140+0 records in244140+0 records out999997440 bytes (1.0 GB) copied, 0.635499 s, 1.6 GB/sreal    0m0.678suser    0m0.657ssys 0m1.273s

Not bad! It turns out that the ideal read size in this case is roughly 8k-16KB, much smaller and much larger sizes are somewhat slower. This is probably related to the 4KB block size we asked dd to use.

Select and non-blocking I/O

When I was looking at this type of problem before, I headed in the direction of using select(), nonblocking I/O, and a single thread. An example of that is in my question here: How to read and write from subprocesses asynchronously?. That was for reading from two processes in parallel, which I have extended below to reading from one process and writing to another. The nonblocking writes are limited to PIPE_BUF or less in size, which is 4KB on my system; for simplicity, the reads are also 4KB although they could be any size. This has a few weird corner cases (and inexplicable hangs, depending on the details) but in the form below it works reliably.

import subprocess, select, fcntl, os, sysp1 = subprocess.Popen(["dd", "if=/dev/zero", "bs=4k", "count=244140"], stdout=subprocess.PIPE)p2 = subprocess.Popen(["dd", "of=/dev/null", "bs=4k"], stdin=subprocess.PIPE)def make_nonblocking(fd):    flags = fcntl.fcntl(fd, fcntl.F_GETFL)    fcntl.fcntl(fd, fcntl.F_SETFL, flags | os.O_NONBLOCK)make_nonblocking(p1.stdout)make_nonblocking(p2.stdin)print "PIPE_BUF = %d" % (select.PIPE_BUF)read_size = select.PIPE_BUFmax_buf_len = 1 # For reasons which I have not debugged completely, this hangs sometimes when set > 1bufs = []while True:    inputready, outputready, exceptready = select.select([ p1.stdout.fileno() ],[ p2.stdin.fileno() ],[])     for fd in inputready:         if fd == p1.stdout.fileno():            if len(bufs) < max_buf_len:                data = p1.stdout.read(read_size)                bufs.append(data)    for fd in outputready:         if fd == p2.stdin.fileno() and len(bufs) > 0:            data = bufs.pop(0)            p2.stdin.write(data)    p1.poll()    # If the first process is done and there is nothing more to write out    if p1.returncode != None and len(bufs) == 0:        # Again cleanup is tricky.  We expect the second process to finish soon after its input is closed        p2.stdin.close()        p2.wait()        p1.wait()        break

Results:

PIPE_BUF = 4096244140+0 records in244140+0 records out999997440 bytes (1.0 GB) copied, 3.13722 s, 319 MB/s244133+0 records in244133+0 records out999968768 bytes (1.0 GB) copied, 3.13599 s, 319 MB/sreal    0m3.167suser    0m2.719ssys 0m2.373s

This is however significantly slower than the version above (even if the read/write size is made 4KB in both for an apples-to-apples comparison). I'm not sure why.

P.S. Late addition: It appears that it is ok to ignore or exceed PIPE_BUF. This causes an IOError exception to be thrown much of the time from p2.stdin.write() (errno=11, temporarily unavailable), presumably when there is enough room in the pipe to write something, but less than the full size we are requesting. The same code above with read_size = 64*1024, and with that exception caught and ignored, runs at 1.4GB/s.

Pipe directly

Just as a baseline, how fast is it to run this using the shell version of pipes (in subprocess)? Let's have a look:

import subprocesssubprocess.call("dd if=/dev/zero bs=4k count=244140 | dd of=/dev/null bs=4k", shell=True)

Results:

244140+0 records in244140+0 records out244140+0 records in244140+0 records out999997440 bytes (1.0 GB) copied, 0.425261 s, 2.4 GB/s999997440 bytes (1.0 GB) copied, 0.423687 s, 2.4 GB/sreal    0m0.466suser    0m0.300ssys 0m0.590s

This is notably faster than the threaded python example. However, this is just one copy, while the threaded python version is doing two (into and out of python). Modifying the command to "dd if=/dev/zero bs=4k count=244140 | dd bs=4k | dd of=/dev/null bs=4k" bring the performance to 1.6GB, in line with the python example.

How to run a comparison in a complete system

Some additional thoughts on how to run a comparison in a complete system. Again for simplicity this is just two processes, and both scripts have the exact same convert_A_to_B() function.

Script 1: Pass data in python, as above

A_process = subprocess.Popen(["A", ...B_process = subprocess.Popen(["B", ...convert_A_to_B_thread = threading.Thread(target=convert_A_to_B, ...

Script 2: Comparison script, pass data in shell

convert_A_to_B(sys.stdin, sys.stdout)

run this in the shell with: A | python script_2.py | B

This allows an apples-to-apples comparison in a complete system, without using mock functions/processes.

How does block read size affect the results

For this test, the code from the first (threaded) example above is used, and both dd and the python script are set to use the same block size reads/writes.

| Block size | Throughput ||------------|------------|| 1KB | 249MB/s || 2KB | 416MB/s || 4KB | 552MB/s || 8KB | 1.4GB/s || 16KB | 1.8GB/s || 32KB | 2.9GB/s || 64KB | 3.0GB/s || 128KB | 1.0GB/s || 256KB | 600MB/s |

In theory there should be better performance with larger buffers (perhaps up to cache effects) but in practice Linux pipes slow down with very large buffers, even when using pure shell pipes.


Your calls to subprocess.Popen() implicitly specify the default value of bufsize, 0, which forces unbuffered I/O. Try adding a reasonable buffer size (4K, 16K, even 1M) and see if it makes any difference.