3
\$\begingroup\$

I have a df and need to count how many adjacent columns have the same sign as other columns based on the sign of the first column, and multiply by the sign of the first column.

What I need to speed up is the calc_df function, which runs like this on my computer:

%timeit calc_df(df)
6.38 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The output of my code is:

 a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9
0 0.097627 0.430379 0.205527 0.089766 -0.152690 0.291788 -0.124826 0.783546 0.927326 -0.233117
1 0.583450 0.057790 0.136089 0.851193 -0.857928 -0.825741 -0.959563 0.665240 0.556314 0.740024
2 0.957237 0.598317 -0.077041 0.561058 -0.763451 0.279842 -0.713293 0.889338 0.043697 -0.170676
3 -0.470889 0.548467 -0.087699 0.136868 -0.962420 0.235271 0.224191 0.233868 0.887496 0.363641
4 -0.280984 -0.125936 0.395262 -0.879549 0.333533 0.341276 -0.579235 -0.742147 -0.369143 -0.272578
0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0

My code is as follows, where the generate_data function generates demo data, which is consistent with my actual data volume.

import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data() -> pd.DataFrame:
 col = [f'a_{x}' for x in range(10)]
 df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
 return df
@njit
def calc_numba(s: np.array) -> float:
 a = s[0]
 b = 1
 for sign in s[1:]:
 if sign == a:
 b += 1
 else:
 break
 b *= a
 return b
def calc_series(s: pd.Series) -> float:
 return calc_numba(s.to_numpy())
def calc_df(df: pd.DataFrame) -> pd.DataFrame:
 df1 = np.sign(df)
 df['count'] = df1.apply(calc_series, axis=1)
 return df
def main() -> None:
 df = generate_data()
 print(df.head(5))
 df = calc_df(df)
 print(df['count'].head(5))
 return
if __name__ == '__main__':
 main()
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Aug 25, 2022 at 5:19
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

Just avoiding switching between numpy array and pandas dataframe so much can grant you an easy x4 speedup:

def calc_df(df: pd.DataFrame) -> pd.DataFrame:
 df1 = np.sign(df)
 df['count'] = df1.apply(calc_series, axis=1)
 return df
%%timeit df = generate_data()
calc_df(df)
1.6 s ± 75.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With less casting from pandas to numpy:

def optimized_calc_df(df: pd.DataFrame) -> pd.DataFrame:
 array = np.sign(df.to_numpy())
 df['count'] = np.apply_along_axis(calc_numba, 1, array)
 return df
%%timeit odf = generate_data()
optimized_calc_df(odf)
415 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using your solution and with numpy:

import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data() -> pd.DataFrame:
 col = [f'a_{x}' for x in range(10)]
 np.random.seed(0)
 df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
 return df
# This function generates demo data.
def generate_data_array() -> np.array:
 np.random.seed(0)
 return np.random.uniform(-1, 1, [280000, 10])
%%timeit df = generate_data()
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out_df = m.sum(1)*df1.iloc[:,0]
76.6 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit array = generate_data_array()
array2 = np.sign(array)
array3 = np.minimum.accumulate(np.equal(array2, np.expand_dims(array2[:,0], axis=1)), 1)
out_array = array3.sum(axis=1) * array2[:,0]
44.3 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Toby Speight
87.2k14 gold badges104 silver badges322 bronze badges
answered Aug 25, 2022 at 7:37
\$\endgroup\$
1
\$\begingroup\$

Someone answered me this way:

You can use vectorial code here.

For example with a mask:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = df1.where(m).sum(1)

Output (5 first rows):

0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0
dtype: float64

Time to run on whole data:

269 ms ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Faster alternative:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = m.sum(1)*df1.iloc[:,0]
148 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And you can probably do even better with pure numpy (you have to write a cummin equivalent).

answered Aug 25, 2022 at 7:46
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.