1

I have a python module dataProcessor.py which initialises a large amount of data into memory (approximately 3GB) I want to use this module in different processes which are running simultaneously.

But the problem is there is not enough Memory on machine to run everything at same time due to dataProcessor.py loading data into memory for every process (3GB for each process, so for 3 processes a total of 9GB Memory).

I tried using server-client model to initialise data only once and and serve all processes but this model is too slow. Is there any method to load data only once and have other processes access the methods in module dataProcessor.py

The module I am talking about is Spacy which is written in Cython. The data can be any Python object, and won't change once written. It is OK if the solution is a C extensions to Python.

Is there any alternative to server-client or subprocess model which shares memory.

amon
136k27 gold badges295 silver badges386 bronze badges
asked Jan 22, 2017 at 13:16
5
  • 1
    Please edit your question to supply more information. Currently it's not clear what you are looking for. (1) What kind of data is this? Does it consist of Python objects, or is it essentially just an array of numbers? (2) Is it readonly once initialized, or will the data change over time? (2a) If one process changes the data, should this affect other processes? (3) Considering your server–client prototype was too slow, what kind of performance do you require? (4) Does the solution have to use pure Python, or would you be comfortable with using C to represent the expensive data? Commented Jan 22, 2017 at 13:53
  • After thinking about this for a while, I don't believe there is any easy solution to share all that state between processes. On Linux, it might be possible to load the module in a parent processes and then fork() it for each actual process. As long as the data is not modified, the data of the parent process will be shared as copy-on-write. Otherwise, your server-client model makes sense, and you could try to investigate why it's too slow and how that could be improved. That's probably the most promising approach here. Commented Jan 23, 2017 at 18:14
  • Is it possible to use multiple threads instead of multiple processes? Commented Feb 22, 2017 at 21:32
  • @EarlCrapstone It is easier to implement this using threads as they share memory spaces, but I want to keep the processes independent. Commented Feb 23, 2017 at 8:29
  • @amon: Assuming the contents of the memory is identical across the processes, you can share it between them with mmap. This is the same effect as your fork idea but may be less work. Commented Aug 14, 2024 at 19:01

2 Answers 2

1

First, if you can, put the data initialisation into a function (so initialisation won't happen on import). This helps with testing etc.

You can use multiprocessing.sharedctypes to create variables shared across multiple processes, assuming you're forking into multiple process (and not creating multiple threads). You can then pass around these shared variables to forked processes.

Example:

from ctypes import c_double
from multiprocessing.sharedctypes import Array, Pool
arr = Array(c_double, 402653184) # 3 GB array
arr[0] = 1.0
arr[1] = 2.0
...
def fn(in_arr, j):
 print(in_arr[j])
with Pool() as p:
 p.apply_async(fn, (arr, 0))
 p.apply_async(fn, (arr, 1))
 p.apply_async(fn, (arr, 2))
 p.join()
answered Sep 1, 2017 at 12:00
0

Hide the module behind an API and run it as a server. Instantiate the server that implements this API once and make all communications to and from this module go through the API and the server. You can use IPC or directly a rest API, does not matter really.

That said, it's not really thre quick and easy solution to the problem but it should do the job.

answered Jan 23, 2017 at 19:34
2
  • I have already tried using server and client model but server is the bottle neck so I was hoping to find some other methods to perform same task Commented Jan 23, 2017 at 19:41
  • 1
    what do you mean by the server is the bottleneck ? If you have concurrency issues within the library you should address those, any solution where it is instantiated only once will face these issues regardless of how the communication takes place. Commented Jan 23, 2017 at 19:49

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.