I have a python module dataProcessor.py
which initialises a large amount of data into memory (approximately 3GB) I want to use this module in different processes which are running simultaneously.
But the problem is there is not enough Memory on machine to run everything at same time due to dataProcessor.py
loading data into memory for every process (3GB for each process, so for 3 processes a total of 9GB Memory).
I tried using server-client model to initialise data only once and and serve all processes but this model is too slow. Is there any method to load data only once and have other processes access the methods in module dataProcessor.py
The module I am talking about is Spacy which is written in Cython. The data can be any Python object, and won't change once written. It is OK if the solution is a C extensions to Python.
Is there any alternative to server-client or subprocess model which shares memory.
2 Answers 2
First, if you can, put the data initialisation into a function (so initialisation won't happen on import). This helps with testing etc.
You can use multiprocessing.sharedctypes
to create variables shared across multiple processes, assuming you're forking into multiple process (and not creating multiple threads). You can then pass around these shared variables to forked processes.
Example:
from ctypes import c_double
from multiprocessing.sharedctypes import Array, Pool
arr = Array(c_double, 402653184) # 3 GB array
arr[0] = 1.0
arr[1] = 2.0
...
def fn(in_arr, j):
print(in_arr[j])
with Pool() as p:
p.apply_async(fn, (arr, 0))
p.apply_async(fn, (arr, 1))
p.apply_async(fn, (arr, 2))
p.join()
Hide the module behind an API and run it as a server. Instantiate the server that implements this API once and make all communications to and from this module go through the API and the server. You can use IPC or directly a rest API, does not matter really.
That said, it's not really thre quick and easy solution to the problem but it should do the job.
-
I have already tried using server and client model but server is the bottle neck so I was hoping to find some other methods to perform same taskHarwee– Harwee01/23/2017 19:41:35Commented Jan 23, 2017 at 19:41
-
1what do you mean by the server is the bottleneck ? If you have concurrency issues within the library you should address those, any solution where it is instantiated only once will face these issues regardless of how the communication takes place.Newtopian– Newtopian01/23/2017 19:49:55Commented Jan 23, 2017 at 19:49
fork()
it for each actual process. As long as the data is not modified, the data of the parent process will be shared as copy-on-write. Otherwise, your server-client model makes sense, and you could try to investigate why it's too slow and how that could be improved. That's probably the most promising approach here.