Parallelism in Python

python multithreading parallel-processing message-passing

Generally, you describe a CPU bound calculation. This is not Python's forte. Neither, historically, is multiprocessing.

Threading in the mainstream Python interpreter has been ruled by a dreaded global lock. The new multiprocessing API works around that and gives a worker pool abstraction with pipes and queues and such.

You can write your performance critical code in C or Cython, and use Python for the glue.

python multithreading parallel-processing message-passing

The new (2.6) multiprocessing module is the way to go. It uses subprocesses, which gets around the GIL problem. It also abstracts away some of the local/remote issues, so the choice of running your code locally or spread out over a cluster can be made later. The documentation I've linked above is a fair bit to chew on, but should provide a good basis to get started.

python multithreading parallel-processing message-passing

Ray is an elegant (and fast) library for doing this.

The most basic strategy for parallelizing Python functions is to declare a function with the @ray.remote decorator. Then it can be invoked asynchronously.

import rayimport time# Start the Ray processes (e.g., a scheduler and shared-memory object store).ray.init(num_cpus=8)@ray.remotedef f():    time.sleep(1)# This should take one second assuming you have at least 4 cores.ray.get([f.remote() for _ in range(4)])

You can also parallelize stateful computation using actors, again by using the @ray.remote decorator.

# This assumes you already ran 'import ray' and 'ray.init()'.import time@ray.remoteclass Counter(object):    def __init__(self):        self.x = 0    def inc(self):        self.x += 1    def get_counter(self):        return self.x# Create two actors which will operate in parallel.counter1 = Counter.remote()counter2 = Counter.remote()@ray.remotedef update_counters(counter1, counter2):    for _ in range(1000):        time.sleep(0.25)        counter1.inc.remote()        counter2.inc.remote()# Start three tasks that update the counters in the background also in parallel.update_counters.remote(counter1, counter2)update_counters.remote(counter1, counter2)update_counters.remote(counter1, counter2)# Check the counter values.for _ in range(5):    counter1_val = ray.get(counter1.get_counter.remote())    counter2_val = ray.get(counter2.get_counter.remote())    print("Counter1: {}, Counter2: {}".format(counter1_val, counter2_val))    time.sleep(1)

It has a number of advantages over the multiprocessing module:

The same code runs on a single multi-core machine as well as a large cluster.
Data is shared efficiently between processes on the same machine using shared memory and efficient serialization.
You can parallelize Python functions (using tasks) and Python classes (using actors).
Error messages are propagated nicely.

Ray is a framework I've been helping develop.

CodeHunter

Parallelism in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last