Finding the cause of a BrokenProcessPool in python's concurrent.futures Finding the cause of a BrokenProcessPool in python's concurrent.futures python python

Finding the cause of a BrokenProcessPool in python's concurrent.futures


I think I was able to get as far as possible:

I changed the _queue_management_worker method in my changed ProcessPoolExecutor module such that the exit code of the failed process is printed:

def _queue_management_worker(executor_reference,                             processes,                             pending_work_items,                             work_ids_queue,                             call_queue,                             result_queue):    """Manages the communication between this process and the worker processes.        ...    """    executor = None    def shutting_down():        return _shutdown or executor is None or executor._shutdown_thread    def shutdown_worker():        ...    reader = result_queue._reader    while True:        _add_call_item_to_queue(pending_work_items,                                work_ids_queue,                                call_queue)        sentinels = [p.sentinel for p in processes.values()]        assert sentinels        ready = wait([reader] + sentinels)        if reader in ready:            result_item = reader.recv()        else:                                           # BLOCK INSERTED FOR DIAGNOSIS ONLY ---------            vals = list(processes.values())            for s in ready:                j = sentinels.index(s)                print("is_alive()", vals[j].is_alive())                print("exitcode", vals[j].exitcode)            # -------------------------------------------            # Mark the process pool broken so that submits fail right now.            executor = executor_reference()            if executor is not None:                executor._broken = True                executor._shutdown_thread = True                executor = None            # All futures in flight must be marked failed            for work_id, work_item in pending_work_items.items():                work_item.future.set_exception(                    BrokenProcessPool(                        "A process in the process pool was "                        "terminated abruptly while the future was "                        "running or pending."                    ))                # Delete references to object. See issue16284                del work_item            pending_work_items.clear()            # Terminate remaining workers forcibly: the queues or their            # locks may be in a dirty state and block forever.            for p in processes.values():                p.terminate()            shutdown_worker()            return        ...

Afterwards I looked up the meaning of the exit code:

from multiprocessing.process import _exitcode_to_nameprint(_exitcode_to_name[my_exit_code])

whereby my_exit_code is the exit code that was printed in the block I inserted to the _queue_management_worker. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.


If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.

Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)

no_proxy='*'

Refresh the current context

source ~/.bash_profile

My local versions the issue was seen and worked around are: Python 3.6.0 onmacOS 10.14.1 and 10.13.x

Sources: Issue 30388  Issue 27126