I am trying to apply a function to each group in a pandas dataframe where the function requires access to the entire group (as opposed to just one row). For this I am iterating over each group in the groupby object. Is this the best way to achieve this?
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,1,2,2,2],
'value': [70,10,20,100,50,5,33],
'other_value': [2.3, 3.3, 7.4, 1.1, 5, 10.3, 12]})
def clean_df(df, v_col, other_col):
'''This function is just a made up example and might
get more complex in real life. ;)
'''
prev_points = df[v_col].shift(1)
next_points = df[v_col].shift(-1)
return df[(prev_points > 50) | (next_points < 20)]
grouped = df.groupby('id')
pd.concat([clean_df(group, 'value', 'other_value') for _, group in grouped])
The original dataframe is
id other_value value
0 1 2.3 70
1 1 3.3 10
2 1 7.4 20
3 1 1.1 100
4 2 5.0 50
5 2 10.3 5
6 2 12.0 33
The code will reduce it to
id other_value value
0 1 2.3 70
1 1 3.3 10
4 2 5.0 50
1 Answer 1
You can directly use apply
on the grouped dataframe and it will be passed the whole group:
def clean_df(df, v_col='value', other_col='other_value'):
'''This function is just a made up example and might
get more complex in real life. ;)
'''
prev_points = df[v_col].shift(1)
next_points = df[v_col].shift(-1)
return df[(prev_points > 50) | (next_points < 20)]
df.groupby('id').apply(clean_df).reset_index(level=0, drop=True)
# id other_value value
# 0 1 2.3 70
# 1 1 3.3 10
# 4 2 5.0 50
Note that I had to give the other arguments default values, since the function that is applied needs to have only one argument. Another way around this is to make a function that returns the function:
def clean_df(v_col, other_col):
'''This function is just a made up example and might
get more complex in real life. ;)
'''
def wrapper(df):
prev_points = df[v_col].shift(1)
next_points = df[v_col].shift(-1)
return df[(prev_points > 50) | (next_points < 20)]
return wrapper
Which you can use like this:
df.groupby('id').apply(clean_df('value', 'other_value')).reset_index(level=0, drop=True)
Or you can use functools.partial
with your clean_df
:
from functools import partial
df.groupby('id') \
.apply(partial(clean_df, v_col='value', other_col='other_value')) \
.reset_index(level=0, drop=True)