Problem description:
I have a numpy array cps
from which I want to extract the maximum value in each of the
subsets of the array (a subset is defined as non-zero values which fall between zeros). The input and how the output should look like are illustrated in the following figure.
Note: cps
only contains positive values and zeros.
The input and the desired output
for the moment, I am doing this as below:
## cps is a numpy array which contains values as observed in the figure
idx = np.nonzero(cps)
subsets = np.array_split(idx[0],np.where(np.diff(idx[0])!=1)[0]+1)
lst = np.full(len(cps), 0)
for subset in subsets:
min = subset[0] ## index of the start of subset (the smallest index number)
max = subset[-1] ## index of the end of subset (the largest index number)
index_of_change_point_with_max_value = min + np.argmax(cps[min:max+1])
lst[index_of_change_point_with_max_value] = cps[index_of_change_point_with_max_value]
However, I was wondering if there is anyway to do this in a vectorized manner and getting rid of the for loop?
2 Answers 2
I found a neat way to do the trick by using the argrelmax() function from scipy.signal
module. It seems that it does not increase the performance in my code, however, the code becomes more readable.
The solution:
import scipy.signal as sig
lst = np.full(len(cps), 0)
indices = sig.argrelmax(cps)
lst[indices] = cps[indices]
-
\$\begingroup\$ May not work for subarrays that contain only negative numbers, as 0 is greater than negative numbers. \$\endgroup\$wei2912– wei29122014年10月02日 12:54:40 +00:00Commented Oct 2, 2014 at 12:54
-
\$\begingroup\$ True. But luckily, the
cps
array in my case only contains positive values! \$\endgroup\$Dataman– Dataman2014年10月02日 13:00:42 +00:00Commented Oct 2, 2014 at 13:00 -
\$\begingroup\$ Ahh, next time pls state that in your question. :) The subset being defined as "non-zero values that fall between zeros" was misleading. \$\endgroup\$wei2912– wei29122014年10月02日 13:03:14 +00:00Commented Oct 2, 2014 at 13:03
-
\$\begingroup\$ Sure! I was thinking that it was obvious from the illustration! \$\endgroup\$Dataman– Dataman2014年10月02日 13:13:38 +00:00Commented Oct 2, 2014 at 13:13
Single pass algorithm (~2n)
You can do it with a single pass over the array (which I shall denote as lst
).
- Maintain a maximum value (I shall denote this as
max_val
) once you hit a non-zero value. When we reach a zero, add(i, max_val)
to the list of maximum values and reset the maximum value. - Build the new list using the list of maximum values.
This should give you pretty good performance (~2n since you iterate 2 times). However, if you need a parallel version...
Parallel algorithm
This is the first parallel algorithm I've ever designed, so pardon me if I screwed up somewhere. We use the MapReduce model which is quite simple to write.
Let k
be the number of processes which you want to divide the work among and i
be the number of each process where 0 <= i < k
.
In the Map()
procedure:
- Each process starts at
n/k * i
indice and iterates till it hitsn/k * (i+1)
. - Follow the algorithm as described in step 1 of the single pass algorithm.
- If the first or last value is non-zero, it indicates that the values we have obtained may not be the maximum values of the subset as we may have only monitored a portion of the subset. Hence, we add it to the portions list.
When we flush, we mark the start and end indice that we have scanned and the maximum value, so as to indicate that we have not finished scanning that subarray.
For k = 3
and our array being [0, 0, 0, 0, 1] + [2, 3, 4, 0, 0] + [0, 2, 1, 0, 2]
with a length of 15 (split into 3 segments for convenience):
i = 0: lst[0] = 0
lst[1] = 0
lst[2] = 0
lst[3] = 0
lst[4] = 1 # is_portion = True, max_val = 1
# add (4, 5, (4, 1)) to the list of portions
return [] and [(4, 5, (4, 1))]
i = 1: lst[5] = 2 # is_portion = True, max_val = 2
lst[6] = 3 # max_val = 3
lst[7] = 4 # max_val = 4
lst[8] = 0 # add (5, 8, (8, 4)) to the list of portions
lst[9] = 0
return [] and [(5, 8, (8, 4))]
i = 2: lst[10] = 0
lst[11] = 2 # max_val = 2
lst[12] = 1
lst[13] = 0 # add (11, 2) to the list of maximum values
lst[14] = 2 # is_portion = True, max_val = 2
# add (14, 15, (14, 2)) to the list of portions
return [(11, 2)] and [(14, 15, (14, 2))]
In the Reduce()
procedure, we merge the arrays we have together to form two lists:
1. Maximum values: [(11, 2)]
2. Portions not yet finished scanning: [(4, 5, (4, 1)), (5, 8, (8, 4)), (14, 15, (14, 2))]
We merge the tuples (4, 5, (4, 1)) and (5, 8, (8, 4)) together and compare the maximum value found. (8, 4) > (4, 1), so we get a maximum value (8, 4).
The last tuple, (14, 15, (14, 2)), ends at the boundary of the array, so we immediately return the maximum value (14, 2).
We now get [(11, 2), (8, 4), (14, 2)]
. We build the list using step 2 and we're done!
The time complexity should be around ~2n if you choose a good k
value.
-
\$\begingroup\$ I am going to implement your first idea and do some benchmarking and see which one is faster. I was considering it even before but I was ruling it out since I was thinking it is implemented by a for loop, it would never be any faster! \$\endgroup\$Dataman– Dataman2014年10月02日 10:58:08 +00:00Commented Oct 2, 2014 at 10:58
-
\$\begingroup\$ @Dataman My guess is that your code will be less performant as it involves the copying of data and has the same (or greater) time complexity. I may be mistaken, though. \$\endgroup\$wei2912– wei29122014年10月02日 11:06:17 +00:00Commented Oct 2, 2014 at 11:06
subsets
and there's nofull
function innumpy
, apparently. \$\endgroup\$np.full()
is available in version 1.8 and above. I am using version 1.8.1. Also, I checked the syntax; it should work! There is no missing brackets! (Works just fine on my system!) \$\endgroup\$cps
array. \$\endgroup\$