Sharing data between processes without physically moving it

Question 1

I have a job where I get a lot of separate tasks through. For each task I need to download some data, process it and then upload it again.

I'm using a multiprocessing pool for the processing.

I have a couple of issues I'm unsure of though.

Firstly the data can be up to 20MB roughly, I ideally want to get it to the child worker process without physically moving it in memory and getting the resulting data back to the parent process as well without moving it. As I'm not sure how some tools are working under the hood I don't know if I can just pass the data as an argument to the pool's apply_async (from my understanding it serialises the objects and then they're created again once the reach the subprocess?), or if I should use a multiprocessing Queue or mmap maybe? or something else?

I looked at ctypes objects but from what I understand only the objects that are defined when the pool is created when the process forks can be shared? Which is no good for me as I'll continuously have new data coming in which I need to share.

One thing I shouldn't need to worry about is any concurrent access on the data so I shouldn't need any type of locking. This is because the processing will only start after the data has been downloaded, and the upload will also only start after the output data has been generated.

Another issue I'm having is that sometimes the tasks coming in might spike and as a result I'm downloading data for the tasks quicker than the child processes can process it. So therefore I'm downloading data quicker than I can finish the tasks and dispose of the data and python is dying from running out of memory. What would be a good way to hold up the tasks at the downloading stage when memory is almost full / too much data is in the job pipeline? I was thinking of some type of "ref" count by using the number of data bytes so I can limit the amount of data between download and upload and only download when the number is below some threshold. Although I'd be worried a child might sometimes fail and I'd never get to take the data it had off of the count. Is there a good way to achieve this kind of thing?

Question 2

if your network can produce data faster than your pool of processes can process it then you shouldn't worry about moving data between processes: RAM is typically faster than network so your bottleneck is not in moving data between processes but how fast they can process it.

Question 3

@Sebastian The speed of the network is irrelevant here because the process can't start until all the data is in memory. If it was streaming from the network to the process then you'd be right. So the download happens, then the data has to get to the process (either by passing a reference, or by moving it physically to a new location in memory), and only then can the processing begin. So that time will add to the overall time.

Question 4

read: "I'm downloading data for the tasks quicker than the child processes can process it." and then reread my previous comment.

Question 5

@Sebastian Ok sorry, I think I get you now. For some reason I was thinking the first thing the process needed to do was make a copy of the data, but actually that will probably be done when the object is shared. But either way, the process will have to make the copy either to get the initial data, or to put the resulting data- so it definitely will add to the overall time.

Question 6

(This is an outcome of the discussion of my previous answer)

Have you tried POSH

This example shows that one can append elements to a mutable list, which is probably what you want (copied from documentation):

import posh
l = posh.share(range(3))
if posh.fork():
 #parent process
 l.append(3)
 posh.waitall()
else:
 # child process
 l.append(4)
 posh.exit(0)
print l
-- Output --
[0, 1, 2, 3, 4]
 -- OR --
[0, 1, 2, 4, 3]

Question 7

Cheers! It does look like this is what I need, however, the catch is it only supports 32bit (it was made in 2003, and they implement their own memory pointers). If I import it in 64bit python I get a Segmentation fault.

Question 8

The issue is I need to address more than 4GB of memory so my current idea is.. build a 32bit version of python to run on 64bit linux (easier said than done lol) and have some sort of hierarchy so that there is a main process that gets requests, it then sends them down to say 4 child processes, each of which can address up to 4GB in a posh dictionary and each of these has a process pool for the work.

Question 9

Here is cannonical example from multiprocessing documentation:

from multiprocessing import Process, Value, Array

def f(n, a):
 n.value = 3.1415927
 for i in range(len(a)):
 a[i] = -a[i]
if __name__ == '__main__':
 num = Value('d', 0.0)
 arr = Array('i', range(10))
 p = Process(target=f, args=(num, arr))
 p.start()
 p.join()
 print num.value
 print arr[:]

Note that num and arr are shared objects. Is it what you are looking for?

Question 10

As I was saying in the question, from what I understand only the Value and Array objects that are defined when the process forks are shared. As I'm using a process pool that gets created when the program starts, before I have any data, I'm pretty sure I cant use these.

Question 11

@GP89 Sorry I missed that. Can't you regenerate the pool with the new data after it is downloaded? Are the workers expensive to initialize?

Question 12

@GP89 I also found this module that may help you: poshmodule.sourceforge.net (I haven't used it myself).

Question 13

@GP89: you only need to know the maximum size that you'd like to store in Array before starting the processes. You can set/get data from it at any time.

Question 14

@GP89 are you sure about POSH? This example shows that one can append elements (which is probably what you want): poshmodule.sourceforge.net/posh/html/node8.html

Question 15

I clobbered this together since I need to figure this out for myself anyway. I'm by no means very accomplished when it comes to multiprocessing or threading, but at least it works. Maybe it can be done in a smarter way, I couldn't figure out how to use the lock that comes with the non-raw Array type. Maybe someone will suggest improvements in comments.

from multiprocessing import Process, Event
from multiprocessing.sharedctypes import RawArray
def modify(s, task_event, result_event):
 for i in range(4):
 print "Worker: waiting for task"
 task_event.wait()
 task_event.clear()
 print "Worker: got task"
 s.value = s.value.upper()
 result_event.set()
if __name__ == '__main__':
 data_list = ("Data", "More data", "oh look, data!", "Captain Pickard")
 task_event = Event()
 result_event = Event()
 s = RawArray('c', "X" * max(map(len, data_list)))
 p = Process(target=modify, args=(s, task_event, result_event))
 p.start()
 for data in data_list:
 s.value = data
 task_event.set()
 print "Sent new task. Waiting for results."
 result_event.wait()
 result_event.clear()
 print "Got result: {0}".format(s.value)
 p.join()

In this example, data_list is defined beforehand, but it need not be. The only information I needed from that list was the length of the longest string. As long as you have some practical upper bound for the length, it's no problem.

Here's the output of the program:

Sent new task. Waiting for results.
Worker: waiting for task
Worker: got task
Worker: waiting for task
Got result: DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: MORE DATA
Sent new task. Waiting for results.
Worker: got task
Worker: waiting for task
Got result: OH LOOK, DATA!
Sent new task. Waiting for results.
Worker: got task
Got result: CAPTAIN PICKARD

As you can see, btel did in fact provide the solution, but the problem lay in keeping the two processes in lockstep with each other, so that the worker only starts working on a new task when the task is ready, and so that the main process doesn't read the result before it's complete.

Question 16

Hm unfortunately not only do I not know the upper bound of the longest data I also don't know how many concurrent requests I might have. As I have a process pool, I'll be queueing data and I'll need an unknown amount of Array objects to hold it all, if I did set a max length on all the Arrays I'd have a lot of wasted memory over allocating if I get a lots of data with small length through, which will reduce the amount of requests I can process at once.

Question 17

And as well I can't block on something like an event wait as I'll have jobs finishing all the time from the pool that I'll need to upload. I wonder if there's a way the child process can trigger something in the parent to say that it's finished.

Question 18

@GP89 As far as I have been able to determine, these are the limitations you have to deal with when working with shared memory. Perhaps you could give each array a number and then use a queue on which you just put the numbers of the arrays that contain new tasks, with a similar queue for results?

Question 19

you could use multiprocessing.Condition for multiple producer/consumer scenario

btel 5,6936 gold badges40 silver badges48 bronze badges · Accepted Answer · 2012-11-16 17:20:45Z

2

(This is an outcome of the discussion of my previous answer)

Have you tried POSH

This example shows that one can append elements to a mutable list, which is probably what you want (copied from documentation):

import posh
l = posh.share(range(3))
if posh.fork():
 #parent process
 l.append(3)
 posh.waitall()
else:
 # child process
 l.append(4)
 posh.exit(0)
print l
-- Output --
[0, 1, 2, 3, 4]
 -- OR --
[0, 1, 2, 4, 3]

Share

Improve this answer

answered Nov 16, 2012 at 17:20

btel's user avatar

btel

5,6936 gold badges40 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

GP89

GP89 Over a year ago

Cheers! It does look like this is what I need, however, the catch is it only supports 32bit (it was made in 2003, and they implement their own memory pointers). If I import it in 64bit python I get a Segmentation fault.

2012年11月16日T20:38:39.21Z+00:00

GP89

GP89 Over a year ago

The issue is I need to address more than 4GB of memory so my current idea is.. build a 32bit version of python to run on 64bit linux (easier said than done lol) and have some sort of hierarchy so that there is a main process that gets requests, it then sends them down to say 4 child processes, each of which can address up to 4GB in a posh dictionary and each of these has a process pool for the work.

2012年11月16日T20:38:55.263Z+00:00

CollectivesTM on Stack Overflow

Sharing data between processes without physically moving it

3 Answers 3

2 Comments

12 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

12 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related