Intercolumn statistics between columns in a dataframe

Question 1

I have a df and need to count how many adjacent columns have the same sign as other columns based on the sign of the first column, and multiply by the sign of the first column.

What I need to speed up is the calc_df function, which runs like this on my computer:

%timeit calc_df(df)
6.38 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The output of my code is:

 a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9
0 0.097627 0.430379 0.205527 0.089766 -0.152690 0.291788 -0.124826 0.783546 0.927326 -0.233117
1 0.583450 0.057790 0.136089 0.851193 -0.857928 -0.825741 -0.959563 0.665240 0.556314 0.740024
2 0.957237 0.598317 -0.077041 0.561058 -0.763451 0.279842 -0.713293 0.889338 0.043697 -0.170676
3 -0.470889 0.548467 -0.087699 0.136868 -0.962420 0.235271 0.224191 0.233868 0.887496 0.363641
4 -0.280984 -0.125936 0.395262 -0.879549 0.333533 0.341276 -0.579235 -0.742147 -0.369143 -0.272578
0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0

My code is as follows, where the generate_data function generates demo data, which is consistent with my actual data volume.

import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data() -> pd.DataFrame:
 col = [f'a_{x}' for x in range(10)]
 df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
 return df
@njit
def calc_numba(s: np.array) -> float:
 a = s[0]
 b = 1
 for sign in s[1:]:
 if sign == a:
 b += 1
 else:
 break
 b *= a
 return b
def calc_series(s: pd.Series) -> float:
 return calc_numba(s.to_numpy())
def calc_df(df: pd.DataFrame) -> pd.DataFrame:
 df1 = np.sign(df)
 df['count'] = df1.apply(calc_series, axis=1)
 return df
def main() -> None:
 df = generate_data()
 print(df.head(5))
 df = calc_df(df)
 print(df['count'].head(5))
 return
if __name__ == '__main__':
 main()

Question 2

Just avoiding switching between numpy array and pandas dataframe so much can grant you an easy x4 speedup:

def calc_df(df: pd.DataFrame) -> pd.DataFrame:
 df1 = np.sign(df)
 df['count'] = df1.apply(calc_series, axis=1)
 return df

%%timeit df = generate_data()
calc_df(df)
1.6 s ± 75.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With less casting from pandas to numpy:

def optimized_calc_df(df: pd.DataFrame) -> pd.DataFrame:
 array = np.sign(df.to_numpy())
 df['count'] = np.apply_along_axis(calc_numba, 1, array)
 return df

%%timeit odf = generate_data()
optimized_calc_df(odf)
415 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using your solution and with numpy:

import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data() -> pd.DataFrame:
 col = [f'a_{x}' for x in range(10)]
 np.random.seed(0)
 df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
 return df
# This function generates demo data.
def generate_data_array() -> np.array:
 np.random.seed(0)
 return np.random.uniform(-1, 1, [280000, 10])

%%timeit df = generate_data()
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out_df = m.sum(1)*df1.iloc[:,0]
76.6 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit array = generate_data_array()
array2 = np.sign(array)
array3 = np.minimum.accumulate(np.equal(array2, np.expand_dims(array2[:,0], axis=1)), 1)
out_array = array3.sum(axis=1) * array2[:,0]
44.3 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 3

Someone answered me this way:

You can use vectorial code here.

For example with a mask:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = df1.where(m).sum(1)

Output (5 first rows):

0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0
dtype: float64

Time to run on whole data:

269 ms ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Faster alternative:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = m.sum(1)*df1.iloc[:,0]
148 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And you can probably do even better with pure numpy (you have to write a cummin equivalent).

Kalak Kalak 1761 silver badge7 bronze badges · Accepted Answer · 2022-08-25 07:37:30Z

Just avoiding switching between numpy array and pandas dataframe so much can grant you an easy x4 speedup:

def calc_df(df: pd.DataFrame) -> pd.DataFrame:
 df1 = np.sign(df)
 df['count'] = df1.apply(calc_series, axis=1)
 return df

%%timeit df = generate_data()
calc_df(df)
1.6 s ± 75.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With less casting from pandas to numpy:

def optimized_calc_df(df: pd.DataFrame) -> pd.DataFrame:
 array = np.sign(df.to_numpy())
 df['count'] = np.apply_along_axis(calc_numba, 1, array)
 return df

%%timeit odf = generate_data()
optimized_calc_df(odf)
415 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using your solution and with numpy:

import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data() -> pd.DataFrame:
 col = [f'a_{x}' for x in range(10)]
 np.random.seed(0)
 df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
 return df
# This function generates demo data.
def generate_data_array() -> np.array:
 np.random.seed(0)
 return np.random.uniform(-1, 1, [280000, 10])

%%timeit df = generate_data()
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out_df = m.sum(1)*df1.iloc[:,0]
76.6 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit array = generate_data_array()
array2 = np.sign(array)
array3 = np.minimum.accumulate(np.equal(array2, np.expand_dims(array2[:,0], axis=1)), 1)
out_array = array3.sum(axis=1) * array2[:,0]
44.3 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Stack Exchange Network

Intercolumn statistics between columns in a dataframe

2 Answers 2

Using your solution and with numpy:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Intercolumn statistics between columns in a dataframe

2 Answers 2

Using your solution and with numpy:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions