I want to group by id
, apply a custom function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
2 Answers 2
I tried a few variations on your code. I was surprised at how performant the groupby approach really was!
I changed your test data to use more values -- this amortizes the overhead a bit more. Surprisingly, the overhead is a lot of the difference. When I pushed the array length too high, the differences got very small for the groupby-based alternatives.
That said, there are some things you can do to speed things up:
original: 18.022180362
org_prefill: 14.969489811999996
unique_keys: 23.526526202000007
groupby_return: 15.557421341999998
groupby_prefill: 15.846651952999991
shifty: 9.605120361000004
I tried moving away from groupby
by iterating over the distinct key values, but that didn't pay off: the performance got worse (unique_keys
). I tried playing games with the return value from groupby, hoping to eliminate some duplicated effort. I eventually got that in groupby_return
. For small sizes, where the overhead is more of a factor, I got a tiny speed boost by pre-filling the result column before running the groupby. That's groupby_prefill
and then org_prefill
where I back-ported it. You can see that it pays off versus the original code, but not against the groupby_return
code.
Finally, I eliminated the groupby entirely, by figuring out how to detect the start of a group using .shift()
. Then I computed a shifted-by-one series and did the subtract operation as a single expression. That's shifty
and it's the most performant by a bunch. W00t!
import sys
import timeit
import numpy as np
import pandas as pd
def make_df():
n = 10_000
df = pd.DataFrame({'id': ['a']*(n//2) + ['b']*(n//2),
'x': np.random.randn(n)})
return df
def original(df):
def my_func(group):
group['diff'] = (group['x'] - group['x'].shift(
1, fill_value=group['x'].iat[0]))
return group
df.groupby('id').apply(my_func)
def org_prefill(df):
def my_func(group):
group['diff'] = (group['x'] - group['x'].shift(
1, fill_value=group['x'].iat[0]))
return group
df['diff'] = df['x']
df.groupby('id').apply(my_func)
def unique_keys(df):
#print("DF:\n", df)
df['diff'] = 0
for key in df.id.unique():
matches = (df.id == key)
#df.loc[matches, 'diff'] = df.loc[matches, 'x'] - df.loc[matches, 'x'].shift(1, fill_value=df.loc[matches, 'x'].iat[0])
df_lmx = df.loc[matches, 'x']
df.loc[matches, 'diff'] = df_lmx - df_lmx.shift(1, fill_value=df_lmx.iat[0])
def groupby_iter(df):
for key, subset in df.groupby('id'):
subset['diff'] = subset.x - subset.x.shift(1,
fill_value=subset.x.iat[0])
def groupby_return(df):
def my_func2(group):
gx = group['x']
result = gx - gx.shift(1, fill_value=gx.iat[0])
return result
res = df.groupby('id').apply(my_func2)
df['diff'] = res.values
def groupby_prefill(df):
def my_func2(group):
gx = group['x']
result = gx - gx.shift(1, fill_value=gx.iat[0])
return result
df['diff'] = df['x']
res = df.groupby('id').apply(my_func2)
df['diff'] = res.values
def shifty(df):
shifted = df['x'].shift(fill_value=df['x'].iat[0])
shifted.loc[(df.id != df.id.shift())] = df['x']
df['diff'] = df['x'] - shifted
if __name__ == '__main__':
kwargs = {
'globals': globals(),
'number': 1000,
'setup': 'df = make_df()',
}
print("original:", timeit.timeit('original(df)', **kwargs))
print("org_prefill:", timeit.timeit('org_prefill(df)', **kwargs))
print("unique_keys:", timeit.timeit('unique_keys(df)', **kwargs))
#print("groupby_iter:", timeit.timeit('groupby_iter(df)', **kwargs))
print("groupby_return:", timeit.timeit('groupby_return(df)', **kwargs))
print("groupby_prefill:", timeit.timeit('groupby_prefill(df)', **kwargs))
print("shifty:", timeit.timeit('shifty(df)', **kwargs))
You may wish to try numba. Turn the DataFrame columns into Numpy arrays. Although, I couldn't get it working with letters, here it is with number id's. (ran in Jupyter)
import sys
import timeit
import numpy as np
import pandas as pd
from numba import jit
n = 1000
id_arr = np.concatenate((np.tile(1, n//2), np.tile(2, n//2)), axis=None)
df = pd.DataFrame({'id': id_arr,
'x': np.random.randn(n)})
@jit(nopython=True)
def calculator_nb(id, x):
res = np.empty(x.shape)
res[0] = 0
for i in range(1, res.shape[0]):
if id[i] == id[i-1]:
res[i] = x[i] - x[i-1]
else:
res[i] = 0
return res
%timeit calculator_nb(*df[['id', 'x']].values.T)
459 μs ± 1.85 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)