How do I parallelize a simple Python loop?
Using multiple threads on CPython won't give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the
multiprocessing module instead:
pool = multiprocessing.Pool(4)out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))
Note that this won't work in the interactive interpreter.
To avoid the usual FUD around the GIL: There wouldn't be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.
from joblib import Parallel, delayeddef process(i): return i * i results = Parallel(n_jobs=2)(delayed(process)(i) for i in range(10))print(results) # prints [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via
pip install joblib).
Edit on Mar 31, 2021: On
joblibin the above code uses
import multiprocessingunder the hood (and thus multiple processes, which is typically the best way to run CPU work across cores - because of the GIL)
- You can let
joblibuse multiple threads instead of multiple processes, but this (or using
import threadingdirectly) is only beneficial if the threads spend considerable time on I/O (e.g. read/write to disk, send an HTTP request). For I/O work, the GIL does not block the execution of another thread
- Since Python 3.7, as an alternative to
threading, you can parallelise work with asyncio, but the same advice applies like for
import threading(though in contrast to latter, only 1 thread will be used)
- Using multiple processes incurs overhead. You need to check yourself if the above code snippet improves your wall time. Here is another one, for which I confirmed that
joblibproduces better results:
import timefrom joblib import Parallel, delayeddef countdown(n): while n>0: n -= 1 return nt = time.time()for _ in range(20): print(countdown(10**7), end=" ")print(time.time() - t) # takes ~10.5 seconds on medium sized Macbook Prot = time.time()results = Parallel(n_jobs=2)(delayed(countdown)(10**7) for _ in range(20))print(results)print(time.time() - t)# takes ~6.3 seconds on medium sized Macbook Pro
To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting.
Disclaimer: I am the original author of joblib.