Combine Pool.map with shared memory Array in Python multiprocessing

python multiprocessing shared-memory pool

Trying again as I just saw the bounty ;)

Basically I think the error message means what it said - multiprocessing shared memory Arrays can't be passed as arguments (by pickling). It doesn't make sense to serialise the data - the point is the data is shared memory. So you have to make the shared array global. I think it's neater to put it as the attribute of a module, as in my first answer, but just leaving it as a global variable in your example also works well. Taking on board your point of not wanting to set the data before the fork, here is a modified example. If you wanted to have more than one possible shared array (and that's why you wanted to pass toShare as an argument) you could similarly make a global list of shared arrays, and just pass the index to count_it (which would become for c in toShare[i]:).

from sys import stdinfrom multiprocessing import Pool, Array, Processdef count_it( key ):  count = 0  for c in toShare:    if c == key:      count += 1  return countif __name__ == '__main__':  # allocate shared array - want lock=False in this case since we   # aren't writing to it and want to allow multiple processes to access  # at the same time - I think with lock=True there would be little or   # no speedup  maxLength = 50  toShare = Array('c', maxLength, lock=False)  # fork  pool = Pool()  # can set data after fork  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"  if len(testData) > maxLength:      raise ValueError, "Shared array too small to hold data"  toShare[:len(testData)] = testData  print pool.map( count_it, ["a", "b", "s", "d"] )

[EDIT: The above doesn't work on windows because of not using fork. However, the below does work on Windows, still using Pool, so I think this is the closest to what you want:

from sys import stdinfrom multiprocessing import Pool, Array, Processimport mymoduledef count_it( key ):  count = 0  for c in mymodule.toShare:    if c == key:      count += 1  return countdef initProcess(share):  mymodule.toShare = shareif __name__ == '__main__':  # allocate shared array - want lock=False in this case since we   # aren't writing to it and want to allow multiple processes to access  # at the same time - I think with lock=True there would be little or   # no speedup  maxLength = 50  toShare = Array('c', maxLength, lock=False)  # fork  pool = Pool(initializer=initProcess,initargs=(toShare,))  # can set data after fork  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"  if len(testData) > maxLength:      raise ValueError, "Shared array too small to hold data"  toShare[:len(testData)] = testData  print pool.map( count_it, ["a", "b", "s", "d"] )

Not sure why map won't Pickle the array but Process and Pool will - I think perhaps it has be transferred at the point of the subprocess initialization on windows. Note that the data is still set after the fork though.

python multiprocessing shared-memory pool

If the data is read only just make it a variable in a module before the fork from Pool. Then all the child processes should be able to access it, and it won't be copied provided you don't write to it.

import myglobals # anything (empty .py file)myglobals.data = []def count_it( key ):    count = 0    for c in myglobals.data:        if c == key:            count += 1    return countif __name__ == '__main__':myglobals.data = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"pool = Pool()print pool.map( count_it, ["a", "b", "s", "d"] )

If you do want to try to use Array though you could try with the lock=False keyword argument (it is true by default).

python multiprocessing shared-memory pool

The problem I see is that Pool doesn't support pickling shared data through its argument list. That's what the error message means by "objects should only be shared between processes through inheritance". The shared data needs to be inherited, i.e., global if you want to share it using the Pool class.

If you need to pass them explicitly, you may have to use multiprocessing.Process. Here is your reworked example:

from multiprocessing import Process, Array, Queuedef count_it( q, arr, key ):  count = 0  for c in arr:    if c == key:      count += 1  q.put((key, count))if __name__ == '__main__':  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"  # want to share it using shared memory  toShare = Array('c', testData)  q = Queue()  keys = ['a', 'b', 's', 'd']  workers = [Process(target=count_it, args = (q, toShare, key))    for key in keys]  for p in workers:    p.start()  for p in workers:    p.join()  while not q.empty():    print q.get(),

Output: ('s', 9) ('a', 2) ('b', 3) ('d', 12)

The ordering of elements of the queue may vary.

To make this more generic and similar to Pool, you could create a fixed N number of Processes, split the list of keys into N pieces, and then use a wrapper function as the Process target, which will call count_it for each key in the list it is passed, like:

def wrapper( q, arr, keys ):  for k in keys:    count_it(q, arr, k)

CodeHunter

Combine Pool.map with shared memory Array in Python multiprocessing

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last