This function takes a list/array of booleans and converts them to an array that counts the number of either True/False values found next to each other.
I'd like to see this optimized for performance. It's not too slow, but I do use multiple loops with embedded if-else statements, I'm wondering if they're absolutely necessary.
import numpy as np
x = np.random.uniform(1,100,100)
b = x > x.mean()
#function start, input is b
endarray = []
count = 0
instance = True
while True:
subarray = 0
while True:
if count >= len(b):
endarray.append(subarray)
break
if b[count] == instance:
subarray += 1
count += 1
else:
endarray.append(subarray)
instance = not instance
break
if count >= len(b):
break
if len(endarray) % 2 != 0:
endarray = np.append(endarray, 0)
else:
endarray = np.asarray(endarray)
endarray = endarray.reshape(-1,2)
The output is a Nx2 array, where the left-hand values are always a count of Trues, and the right-hand values are always a count of Falses.
After a sequence of False values are no longer continuous(a True value pops up), the next count of True values begin, and vice versa.
Example input
b
Out[31]:
array([ True, True, True, False, True, True, True, True, False,
False, True, False, False, True, False, False, False, False,
True, False, False, False, True, True, True, True, False,
False, True, False, False, False, False, False, False, True,
True, False, True, True, False, False, True, False, False,
True, False, False, True, False, True, False, True, False,
True, True, True, False, True, False, True, True, True,
True, False, False, True, False, True, True, True, True,
True, True, False, True, True, False, True, True, False,
False, True, False, True, False, False, True, True, True,
True, False, False, False, False, False, True, True, True,
True])
Example output
endarray
Out[32]:
array([[3, 1],
[4, 2],
[1, 2],
[1, 4],
[1, 3],
[4, 2],
[1, 6],
[2, 1],
[2, 2],
[1, 2],
[1, 2],
[1, 1],
[1, 1],
[1, 1],
[3, 1],
[1, 1],
[4, 2],
[1, 1],
[6, 1],
[2, 1],
[2, 2],
[1, 1],
[1, 2],
[4, 5],
[4, 0]])
Edit: I wanted to add an updated version of this code, the one in the answer below is not technically correct in all regards. But this was entirely derived from it:
m = np.append(b[0], np.diff(b))
_, c = np.unique(m.cumsum(), return_index=True)
out = np.diff(np.append(c, len(b)))
if b[0] == False:
out = np.append(0, out)
if len(out) % 2:
out = np.append(out, 0)
out = out.reshape(-1, 2)
1 Answer 1
Using itertools.groupby
What you are looking for is itertools.groupby
. When there is an odd number of groups then we use try-except
block here.
from itertools import groupby
get_grp_len = lambda grp: len([*grp])
def transform(b):
if len(b) == 0: # if not b wouldn't work since your `b` is ndarray
return []
it = groupby(b)
out = []
for _, grp in it:
try:
t_size = get_grp_len(grp)
f_size = get_grp_len(next(it)[1])
out.append([t_size, f_size])
except StopIteration:
out.append([t_size, 0])
return out
print(transform(b)) # `b` taken from the question itself.
# same output as expected output posted in question.
NumPy
has a lot of vectorized operations
NumPy
has many vectorized operations, you have to find the correct ones. Not an expert but the below approach should do good.
The idea here is to find the index of the first value from each group and then take the difference.
- We check if ith element is not equal to i+1th element.
- Now use
np.ndarray.cumsum
to give each group a unique sequential number. - Then use
np.unique
to get first index of value from each group. - Find the difference between i+1th value and ith to get size of each grp. We can use
np.diff
m = b != (np.r_[np.nan, b[:-1]])
_, c = np.unique(m.cumsum(), return_index=True)
#print(c)
# array([ 0, 3, 4, 8, 10, 11, 13, 14, 18, 19, 22, 26, 28, 29, 35, 37, 38,
# 40, 42, 43, 45, 46, 48, 49, 50, 51, 52, 53, 54, 57, 58, 59, 60, 64,
# 66, 67, 68, 74, 75, 77, 78, 80, 82, 83, 84, 85, 87, 91, 96])
# np.unique gives the index of the first occurrence, our `b` len is 100
# and `c` stopped at 96. we need to add the last index + 1 to it.
out = np.diff(np.r_[c, len(b)])
# Now, reshape your array, if it's of odd length add a 0 at end.
if len(out) % 2:
out = np.r_[out, 0].reshape(-1, 2)
out = out.reshape(-1, 2)
print(out)
# same output mentioned in the question.
Using Pandas
's GroupBy
A lot of times pandas
and NumPy
are used together so might as well post pandas
code too.
import pandas as pd
s = pd.Series(b)
g = s.ne(s.shift()).cumsum()
out = s.groupby(g).size().to_numpy()
# repeat same step as we did in above solution add a 0 if len is odd.
if len(out)%2:
out = np.r_[out, 0].reshape(-1, 2)
out = out.reshape(-1, 2)
print(out)
Code Review
Everything's good as far as I can see, well-named variables, proper indentation but too many if
:p
-
1\$\begingroup\$ nicely done! The numpy code is ~3x faster than what I made. The
if
statement in the numpy code also needs to be reversed. \$\endgroup\$Estif– Estif2020年11月18日 20:57:30 +00:00Commented Nov 18, 2020 at 20:57 -
\$\begingroup\$ @Estif
np.r_
is written in pure python to make it a little faster replacenp.r_
withnp.hstack
\$\endgroup\$Ch3steR– Ch3steR2020年11月19日 04:28:49 +00:00Commented Nov 19, 2020 at 4:28 -
\$\begingroup\$ I believe
np.append
is actually a bit faster in this case \$\endgroup\$Estif– Estif2020年11月19日 21:02:21 +00:00Commented Nov 19, 2020 at 21:02