3

Despite all the seemingly similar questions and answers, here goes:

I have a fairly large 2D numpy array and would like to process it row by row using multiprocessing. For each row I need to find specific (numeric) values and use them to set values in a second 2D numpy array. A small example (real use for array with appr. 10000x10000 cells):

import numpy as np
inarray = np.array([(1.5,2,3), (4,5.1,6), (2.7, 4.8, 4.3)])
outarray = np.array([(0.0,0.0,0.0), (0.0,0.0,0.0), (0.0,0.0,0.0)])

I would now like to process inarray row by row using multiprocessing, to find all the cells in inarray that are greater than 5 (e.g. inarray[1,1] and inarray[1,2], and set cells in outarray that have index locations one smaller in both dimensions (e.g. outarray[0,0] and outarray[0,1]) to 1.

After looking here and here and here I'm sad to say I still don't know how to do it. Help!

asked May 13, 2014 at 22:42
1
  • So if I find an index in helper = inarray[1:,1:], it will be the same index in outarray... right? Commented May 13, 2014 at 22:57

2 Answers 2

2

If you can use the latest numpy development version, then you can use multithreading instead of multiprocessing. Since this PR was merged a couple of months ago, numpy releases the GIL when indexing, so you can do something like:

import numpy as np
import threading
def target(in_, out):
 out[in_ > .5] = 1
def multi_threaded(a, thread_count=3):
 b = np.zeros_like(a)
 chunk = len(a) // thread_count
 threads = []
 for j in xrange(thread_count):
 sl_a = slice(1 + chunk*j,
 a.shape[0] if j == thread_count-1 else 1 + chunk*(j+1),
 None)
 sl_b = slice(sl_a.start-1, sl_a.stop-1, None)
 threads.append(threading.Thread(target=target, args=(a[sl_a, 1:],
 b[sl_b, :-1])))
 for t in threads:
 t.start()
 for t in threads:
 t.join()
 return b

And now do things like:

In [32]: a = np.random.rand(100, 100000)
In [33]: %timeit multi_threaded(a, 1)
1 loops, best of 3: 121 ms per loop
In [34]: %timeit multi_threaded(a, 2)
10 loops, best of 3: 86.6 ms per loop
In [35]: %timeit multi_threaded(a, 3)
10 loops, best of 3: 79.4 ms per loop
answered May 14, 2014 at 1:15
Sign up to request clarification or add additional context in comments.

Comments

0

I don't think multiprocessing is the right call, because you want to change one object by multiple processes. I think this is not a good idea. I get that it would be nice finding the indexes via multiple processes, but in order to send the data to an other process, the object is internally pickled (again: as far as I know).

Please try this and tell us if it is very slow:

outarray[inarray[1:,1:] > 5] = 1
outarray
array([[ 1., 1., 0.],
 [ 0., 0., 0.],
 [ 0., 0., 0.]])
answered May 13, 2014 at 23:10

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.