1
\$\begingroup\$

I want to group by id, apply a custom function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.

Example

dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
 data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
 return data
dat.groupby('id').apply(my_func)

Output

id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20

Is there a more efficient way to do this?

asked Mar 15, 2019 at 0:24
\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

I tried a few variations on your code. I was surprised at how performant the groupby approach really was!

I changed your test data to use more values -- this amortizes the overhead a bit more. Surprisingly, the overhead is a lot of the difference. When I pushed the array length too high, the differences got very small for the groupby-based alternatives.

That said, there are some things you can do to speed things up:

original: 18.022180362
org_prefill: 14.969489811999996
unique_keys: 23.526526202000007
groupby_return: 15.557421341999998
groupby_prefill: 15.846651952999991
shifty: 9.605120361000004

I tried moving away from groupby by iterating over the distinct key values, but that didn't pay off: the performance got worse (unique_keys). I tried playing games with the return value from groupby, hoping to eliminate some duplicated effort. I eventually got that in groupby_return. For small sizes, where the overhead is more of a factor, I got a tiny speed boost by pre-filling the result column before running the groupby. That's groupby_prefill and then org_prefill where I back-ported it. You can see that it pays off versus the original code, but not against the groupby_return code.

Finally, I eliminated the groupby entirely, by figuring out how to detect the start of a group using .shift(). Then I computed a shifted-by-one series and did the subtract operation as a single expression. That's shifty and it's the most performant by a bunch. W00t!

import sys
import timeit
import numpy as np
import pandas as pd
def make_df():
 n = 10_000
 df = pd.DataFrame({'id': ['a']*(n//2) + ['b']*(n//2),
 'x': np.random.randn(n)})
 return df
def original(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df.groupby('id').apply(my_func)
def org_prefill(df):
 def my_func(group):
 group['diff'] = (group['x'] - group['x'].shift(
 1, fill_value=group['x'].iat[0]))
 return group
 df['diff'] = df['x']
 df.groupby('id').apply(my_func)
def unique_keys(df):
 #print("DF:\n", df)
 df['diff'] = 0
 for key in df.id.unique():
 matches = (df.id == key)
 #df.loc[matches, 'diff'] = df.loc[matches, 'x'] - df.loc[matches, 'x'].shift(1, fill_value=df.loc[matches, 'x'].iat[0])
 df_lmx = df.loc[matches, 'x']
 df.loc[matches, 'diff'] = df_lmx - df_lmx.shift(1, fill_value=df_lmx.iat[0])
def groupby_iter(df):
 for key, subset in df.groupby('id'):
 subset['diff'] = subset.x - subset.x.shift(1,
 fill_value=subset.x.iat[0])
def groupby_return(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def groupby_prefill(df):
 def my_func2(group):
 gx = group['x']
 result = gx - gx.shift(1, fill_value=gx.iat[0])
 return result
 df['diff'] = df['x']
 res = df.groupby('id').apply(my_func2)
 df['diff'] = res.values
def shifty(df):
 shifted = df['x'].shift(fill_value=df['x'].iat[0])
 shifted.loc[(df.id != df.id.shift())] = df['x']
 df['diff'] = df['x'] - shifted
if __name__ == '__main__':
 kwargs = {
 'globals': globals(),
 'number': 1000,
 'setup': 'df = make_df()',
 }
 print("original:", timeit.timeit('original(df)', **kwargs))
 print("org_prefill:", timeit.timeit('org_prefill(df)', **kwargs))
 print("unique_keys:", timeit.timeit('unique_keys(df)', **kwargs))
 #print("groupby_iter:", timeit.timeit('groupby_iter(df)', **kwargs))
 print("groupby_return:", timeit.timeit('groupby_return(df)', **kwargs))
 print("groupby_prefill:", timeit.timeit('groupby_prefill(df)', **kwargs))
 print("shifty:", timeit.timeit('shifty(df)', **kwargs))
answered Mar 15, 2019 at 3:32
\$\endgroup\$
1
\$\begingroup\$

You may wish to try numba. Turn the DataFrame columns into Numpy arrays. Although, I couldn't get it working with letters, here it is with number id's. (ran in Jupyter)

import sys
import timeit
import numpy as np
import pandas as pd
from numba import jit
n = 1000
id_arr = np.concatenate((np.tile(1, n//2), np.tile(2, n//2)), axis=None)
df = pd.DataFrame({'id': id_arr,
 'x': np.random.randn(n)})
@jit(nopython=True)
def calculator_nb(id, x):
 res = np.empty(x.shape)
 res[0] = 0
 for i in range(1, res.shape[0]):
 if id[i] == id[i-1]:
 res[i] = x[i] - x[i-1]
 else: 
 res[i] = 0
 return res
%timeit calculator_nb(*df[['id', 'x']].values.T)
459 μs ± 1.85 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
answered Mar 16, 2019 at 12:40
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.