Groupby, apply custom function to data, return results in new columns

Question 1

I want to group by id, apply a custom function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.

Example

dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
 data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
 return data
dat.groupby('id').apply(my_func)

Output

id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20

Is there a more efficient way to do this?

Question 2

I tried a few variations on your code. I was surprised at how performant the groupby approach really was!

I changed your test data to use more values -- this amortizes the overhead a bit more. Surprisingly, the overhead is a lot of the difference. When I pushed the array length too high, the differences got very small for the groupby-based alternatives.

That said, there are some things you can do to speed things up:

original: 18.022180362
org_prefill: 14.969489811999996
unique_keys: 23.526526202000007
groupby_return: 15.557421341999998
groupby_prefill: 15.846651952999991
shifty: 9.605120361000004

I tried moving away from groupby by iterating over the distinct key values, but that didn't pay off: the performance got worse (unique_keys). I tried playing games with the return value from groupby, hoping to eliminate some duplicated effort. I eventually got that in groupby_return. For small sizes, where the overhead is more of a factor, I got a tiny speed boost by pre-filling the result column before running the groupby. That's groupby_prefill and then org_prefill where I back-ported it. You can see that it pays off versus the original code, but not against the groupby_return code.

Finally, I eliminated the groupby entirely, by figuring out how to detect the start of a group using .shift(). Then I computed a shifted-by-one series and did the subtract operation as a single expression. That's shifty and it's the most performant by a bunch. W00t!

import sys
import timeit
import numpy as np
import pandas as pd
def make_df():
 n = 10_000
 df = pd.DataFrame({'id': ['a']*(n//2) + ['b']*(n//2),
 'x': np.random.randn(n)})
 return df
def original(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df.groupby('id').apply(my_func)
def org_prefill(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df['diff'] = df['x']
 df.groupby('id').apply(my_func)
def unique_keys(df):
 #print("DF:\n", df)
 df['diff'] = 0
 for key in df.id.unique():
 matches = (df.id == key)
 #df.loc[matches, 'diff'] = df.loc[matches, 'x'] - df.loc[matches, 'x'].shift(1, fill_value=df.loc[matches, 'x'].iat[0])
 df_lmx = df.loc[matches, 'x']
 df.loc[matches, 'diff'] = df_lmx - df_lmx.shift(1, fill_value=df_lmx.iat[0])
def groupby_iter(df):
 for key, subset in df.groupby('id'):
 subset['diff'] = subset.x - subset.x.shift(1,
 fill_value=subset.x.iat[0])
def groupby_return(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def groupby_prefill(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 df['diff'] = df['x']
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def shifty(df):
 shifted = df['x'].shift(fill_value=df['x'].iat[0])
 shifted.loc[(df.id != df.id.shift())] = df['x']
 df['diff'] = df['x'] - shifted
if __name__ == '__main__':
 kwargs = {
 'globals': globals(),
 'number': 1000,
 'setup': 'df = make_df()',
 }
 print("original:", timeit.timeit('original(df)', **kwargs))
 print("org_prefill:", timeit.timeit('org_prefill(df)', **kwargs))
 print("unique_keys:", timeit.timeit('unique_keys(df)', **kwargs))
 #print("groupby_iter:", timeit.timeit('groupby_iter(df)', **kwargs))
 print("groupby_return:", timeit.timeit('groupby_return(df)', **kwargs))
 print("groupby_prefill:", timeit.timeit('groupby_prefill(df)', **kwargs))
 print("shifty:", timeit.timeit('shifty(df)', **kwargs))

Question 3

You may wish to try numba. Turn the DataFrame columns into Numpy arrays. Although, I couldn't get it working with letters, here it is with number id's. (ran in Jupyter)

import sys
import timeit
import numpy as np
import pandas as pd
from numba import jit
n = 1000
id_arr = np.concatenate((np.tile(1, n//2), np.tile(2, n//2)), axis=None)
df = pd.DataFrame({'id': id_arr,
 'x': np.random.randn(n)})
@jit(nopython=True)
def calculator_nb(id, x):
 res = np.empty(x.shape)
 res[0] = 0
 for i in range(1, res.shape[0]):
 if id[i] == id[i-1]:
 res[i] = x[i] - x[i-1]
 else: 
 res[i] = 0
 return res
%timeit calculator_nb(*df[['id', 'x']].values.T)
459 μs ± 1.85 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

aghast aghast 12.6k25 silver badges46 bronze badges · Answer 1 · 2019-03-15 03:32:14Z

I tried a few variations on your code. I was surprised at how performant the groupby approach really was!

I changed your test data to use more values -- this amortizes the overhead a bit more. Surprisingly, the overhead is a lot of the difference. When I pushed the array length too high, the differences got very small for the groupby-based alternatives.

That said, there are some things you can do to speed things up:

original: 18.022180362
org_prefill: 14.969489811999996
unique_keys: 23.526526202000007
groupby_return: 15.557421341999998
groupby_prefill: 15.846651952999991
shifty: 9.605120361000004

I tried moving away from groupby by iterating over the distinct key values, but that didn't pay off: the performance got worse (unique_keys). I tried playing games with the return value from groupby, hoping to eliminate some duplicated effort. I eventually got that in groupby_return. For small sizes, where the overhead is more of a factor, I got a tiny speed boost by pre-filling the result column before running the groupby. That's groupby_prefill and then org_prefill where I back-ported it. You can see that it pays off versus the original code, but not against the groupby_return code.

Finally, I eliminated the groupby entirely, by figuring out how to detect the start of a group using .shift(). Then I computed a shifted-by-one series and did the subtract operation as a single expression. That's shifty and it's the most performant by a bunch. W00t!

import sys
import timeit
import numpy as np
import pandas as pd
def make_df():
 n = 10_000
 df = pd.DataFrame({'id': ['a']*(n//2) + ['b']*(n//2),
 'x': np.random.randn(n)})
 return df
def original(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df.groupby('id').apply(my_func)
def org_prefill(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df['diff'] = df['x']
 df.groupby('id').apply(my_func)
def unique_keys(df):
 #print("DF:\n", df)
 df['diff'] = 0
 for key in df.id.unique():
 matches = (df.id == key)
 #df.loc[matches, 'diff'] = df.loc[matches, 'x'] - df.loc[matches, 'x'].shift(1, fill_value=df.loc[matches, 'x'].iat[0])
 df_lmx = df.loc[matches, 'x']
 df.loc[matches, 'diff'] = df_lmx - df_lmx.shift(1, fill_value=df_lmx.iat[0])
def groupby_iter(df):
 for key, subset in df.groupby('id'):
 subset['diff'] = subset.x - subset.x.shift(1,
 fill_value=subset.x.iat[0])
def groupby_return(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def groupby_prefill(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 df['diff'] = df['x']
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def shifty(df):
 shifted = df['x'].shift(fill_value=df['x'].iat[0])
 shifted.loc[(df.id != df.id.shift())] = df['x']
 df['diff'] = df['x'] - shifted
if __name__ == '__main__':
 kwargs = {
 'globals': globals(),
 'number': 1000,
 'setup': 'df = make_df()',
 }
 print("original:", timeit.timeit('original(df)', **kwargs))
 print("org_prefill:", timeit.timeit('org_prefill(df)', **kwargs))
 print("unique_keys:", timeit.timeit('unique_keys(df)', **kwargs))
 #print("groupby_iter:", timeit.timeit('groupby_iter(df)', **kwargs))
 print("groupby_return:", timeit.timeit('groupby_return(df)', **kwargs))
 print("groupby_prefill:", timeit.timeit('groupby_prefill(df)', **kwargs))
 print("shifty:", timeit.timeit('shifty(df)', **kwargs))

run-out run-out 1113 bronze badges · Answer 2 · 2019-03-16 12:40:17Z

You may wish to try numba. Turn the DataFrame columns into Numpy arrays. Although, I couldn't get it working with letters, here it is with number id's. (ran in Jupyter)

import sys
import timeit
import numpy as np
import pandas as pd
from numba import jit
n = 1000
id_arr = np.concatenate((np.tile(1, n//2), np.tile(2, n//2)), axis=None)
df = pd.DataFrame({'id': id_arr,
 'x': np.random.randn(n)})
@jit(nopython=True)
def calculator_nb(id, x):
 res = np.empty(x.shape)
 res[0] = 0
 for i in range(1, res.shape[0]):
 if id[i] == id[i-1]:
 res[i] = x[i] - x[i-1]
 else: 
 res[i] = 0
 return res
%timeit calculator_nb(*df[['id', 'x']].values.T)
459 μs ± 1.85 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Stack Exchange Network

Groupby, apply custom function to data, return results in new columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Groupby, apply custom function to data, return results in new columns

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions