I will break down this question into two parts.
I have code similar to this
for data in data_list:
rslt = query in databae where data == 10 # just some pseudo database query to give example but this single query usually takes around 30-50seconds.
if rslt.property == 'some condition here':
return rslt
The conditions here are
- We have to return the first element of the
data_listthat matches the condition after query. - Each database query for each elements take around 30-40s.
data_listis usually very big, around 15-20k elements- Unfortunately we can not do a single database query for whole
data_list. We have to do this in loop or one element at a time.
Now my questions are,
- How can I optimize this process. Currently this whole process takes around 3-4hrs.
- I read about python threading and multiprocessing but I am confused about which one would be appropriate in this case.
1 Answer 1
You could consider using a multiprocessing Pool. You can then use map to send chunks of your iterable to the workers of the Pool to work on according to a given function. So, lets say your query is a function, say, query(data):
def query(data):
rslt = query in databae where data == 10
if rslt.property == 'some condition here':
return rslt
We will use the pool like so:
from multiprocessing import Pool
with Pool() as pool:
results = pool.map(query, data_list)
Now to your rquierement we will find the first one:
print(next(filter(None, results)))
Note that using the function query this way means that results will be a list of rslts and Nones and we are looking for the first non-None result.
A few Notes:
- Note that the
Pool's constrctor's first argument isprocesseswhich allows you to choose how many processes the pool will hold:If processes is
Nonethen the number returned by os.cpu_count() is used. - Note that
maphas aslo thechunksizeargument which is default to 1 and allows to choose the size of the chunks passed to the workers:This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
Continuing with
map, the docs recommend usingimapfor large iterables with a specific chunk for better efficiency:Note that it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.
And from the
imapdocs:The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of
1.So we could actually be more efficient and do:
chunksize = 100 processes = 10 with Pool(processes=processes) as pool: print(next(filter(None, pool.imap(query, data_list, chunksize=chunksize))))And here you could play with
chunksizeand evenprocesses(back from thePool) and see what combination produces the best results.If you are interested, you could easily switch to threads instead of processes by simply changing your import statement to:
from multiprocessing.dummy import PoolAs the docs say:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Hope this helps in any way
3 Comments
rslt is not None. If I get any result then I don't want to continue the loop.imap which is a lazy version of map. In my example we are wrapping it with filter which is also an iterator, and by doing next on that we are only working on the minimal number of chunks that will produce the first result. And also BTW there is no actual loopExplore related questions
See similar questions with these tags.