1
\$\begingroup\$

I am trying to apply a function to each group in a pandas dataframe where the function requires access to the entire group (as opposed to just one row). For this I am iterating over each group in the groupby object. Is this the best way to achieve this?

import pandas as pd
df = pd.DataFrame({'id': [1,1,1,1,2,2,2], 
 'value': [70,10,20,100,50,5,33], 
 'other_value': [2.3, 3.3, 7.4, 1.1, 5, 10.3, 12]})
def clean_df(df, v_col, other_col):
 '''This function is just a made up example and might 
 get more complex in real life. ;)
 '''
 prev_points = df[v_col].shift(1)
 next_points = df[v_col].shift(-1)
 return df[(prev_points > 50) | (next_points < 20)] 
grouped = df.groupby('id')
pd.concat([clean_df(group, 'value', 'other_value') for _, group in grouped])

The original dataframe is

 id other_value value
0 1 2.3 70
1 1 3.3 10
2 1 7.4 20
3 1 1.1 100
4 2 5.0 50
5 2 10.3 5
6 2 12.0 33

The code will reduce it to

 id other_value value
0 1 2.3 70
1 1 3.3 10
4 2 5.0 50
asked Apr 9, 2019 at 9:44
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

You can directly use apply on the grouped dataframe and it will be passed the whole group:

def clean_df(df, v_col='value', other_col='other_value'):
 '''This function is just a made up example and might 
 get more complex in real life. ;)
 '''
 prev_points = df[v_col].shift(1)
 next_points = df[v_col].shift(-1)
 return df[(prev_points > 50) | (next_points < 20)] 
df.groupby('id').apply(clean_df).reset_index(level=0, drop=True)
# id other_value value
# 0 1 2.3 70
# 1 1 3.3 10
# 4 2 5.0 50

Note that I had to give the other arguments default values, since the function that is applied needs to have only one argument. Another way around this is to make a function that returns the function:

def clean_df(v_col, other_col):
 '''This function is just a made up example and might 
 get more complex in real life. ;)
 '''
 def wrapper(df):
 prev_points = df[v_col].shift(1)
 next_points = df[v_col].shift(-1)
 return df[(prev_points > 50) | (next_points < 20)] 
 return wrapper

Which you can use like this:

df.groupby('id').apply(clean_df('value', 'other_value')).reset_index(level=0, drop=True)

Or you can use functools.partial with your clean_df:

from functools import partial
df.groupby('id') \
 .apply(partial(clean_df, v_col='value', other_col='other_value')) \
 .reset_index(level=0, drop=True)
answered Apr 9, 2019 at 11:54
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.