Finding the cause of a BrokenProcessPool in python's concurrent.futures
I think I was able to get as far as possible:
I changed the _queue_management_worker
method in my changed ProcessPoolExecutor
module such that the exit code of the failed process is printed:
def _queue_management_worker(executor_reference, processes, pending_work_items, work_ids_queue, call_queue, result_queue): """Manages the communication between this process and the worker processes. ... """ executor = None def shutting_down(): return _shutdown or executor is None or executor._shutdown_thread def shutdown_worker(): ... reader = result_queue._reader while True: _add_call_item_to_queue(pending_work_items, work_ids_queue, call_queue) sentinels = [p.sentinel for p in processes.values()] assert sentinels ready = wait([reader] + sentinels) if reader in ready: result_item = reader.recv() else: # BLOCK INSERTED FOR DIAGNOSIS ONLY --------- vals = list(processes.values()) for s in ready: j = sentinels.index(s) print("is_alive()", vals[j].is_alive()) print("exitcode", vals[j].exitcode) # ------------------------------------------- # Mark the process pool broken so that submits fail right now. executor = executor_reference() if executor is not None: executor._broken = True executor._shutdown_thread = True executor = None # All futures in flight must be marked failed for work_id, work_item in pending_work_items.items(): work_item.future.set_exception( BrokenProcessPool( "A process in the process pool was " "terminated abruptly while the future was " "running or pending." )) # Delete references to object. See issue16284 del work_item pending_work_items.clear() # Terminate remaining workers forcibly: the queues or their # locks may be in a dirty state and block forever. for p in processes.values(): p.terminate() shutdown_worker() return ...
Afterwards I looked up the meaning of the exit code:
from multiprocessing.process import _exitcode_to_nameprint(_exitcode_to_name[my_exit_code])
whereby my_exit_code
is the exit code that was printed in the block I inserted to the _queue_management_worker
. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.
If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.
Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)
no_proxy='*'
Refresh the current context
source ~/.bash_profile
My local versions the issue was seen and worked around are: Python 3.6.0 onmacOS 10.14.1 and 10.13.x
Sources: Issue 30388 Issue 27126