Pandas dataframe custom forward fillna optimisation

Question 1

I have a function that synchronizes according to some specifications columns of a dataframe.
The functions works, nevertheless I was wondering how to:

Improve performances
Make it more pythonic

Please feel free to leave any suggestions.

Function specifications

Inputs:
- df : a dataframe with columns:
  - [a0,...aN] : a0 to aN names can be any valid string and contains numeric values
  - [agent,date] : are fixed names, agent contains numeric values and date contains datetime.
- sync_with : The columns to synchronize with (a string or an listof string contained in [a0,..., aN] or, by default, an empty list to synchronize all the [a0,...,aN].
Synchronization:
- Do a forward fillna grouped by agent values.
- Drop the rows where all columns to synchronize with values are empty
Returns : the synchronized dataframe

Here is my function:

import pandas as pd
import numpy as np
def synchronize(df,sync_with=[]):
 _df = df.copy()
 if not isinstance(sync_with,list):
 sync_with = [sync_with]
 _fixed_cols = ['date','agent']
 _fixed_cols.extend(sync_with)
 _colset = [c for c in _df.columns if c not in _fixed_cols]
 for ag in _df.agent.unique():
 _df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
 if _sync_with:
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].dropna(how='all', subset=_sync_with)
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')
 return _df.dropna(how='all', subset=_sync_with)

Sample

foo = pd.DataFrame(dict(date=pd.to_datetime(['2010', '2011', '2012', '2013', '2010', '2013', '2015', '2016']),
 agent=[1,1,1,1,2,2,2,2],
 _a=[1, np.nan, np.nan, 4, 5, np.nan, 7, 8],
 _b=[11, 22, np.nan, np.nan, 55, np.nan, 77, np.nan],
 _c=[111, np.nan, 333, np.nan, np.nan, 666, 777, np.nan]))

Results

# 1. default (13 ms per loop)
print(synchronize(foo))
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日
# 2. sync with one column (35 ms per loop)
print(synchronize(foo,'_c'))
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
# 3. sync with two columns (35 ms per loop)
print(synchronize(foo,['_a','_b']))
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日

Question 2

When I follow what you your explanation says, in test case 2, row 5 should read 5 5.0 55.0 666.0. Which is correct?

Question 3

It is correct, will check as soon as possible my code to correct it, thanks.

Question 4

Have to find why my _df.loc[_df.agent==ag,_colset] skip the index 4...

Question 5

It's ok, I corrected it

Question 6

a few ways to make it more pythonic

'private' variables

variable within a function are limited to the scope of that function, so prepending them with a _ is unnecessary

lists

sync_with could also be another Container, like a tuple or so. Only when it is a str you need to convert it.

also be aware of what can go wrong with mutable default arguments. They are instantiated at the moment of function definition. In this case they are not appended to in the function or returned, so it should not pose any problems, but I try to avoid them anyway

fixed_cols can then be assembled like this:

fixed_cols = ['agent', 'date'] + list(sync_with)

Then only if sync_with is an generator, something can go wrong, so I would do something like this:

def synchronize(df, sync_with=None):
 if sync_with is None:
 sync_with = []
 elif isinstance(sync_with, str):
 sync_with = [sync_with, ]
 else:
 sync_with = list(sync_with)

performance

Making a copy initially is unnecessary, unless you will perform operations inplace on the df
You can use DataFrame.groupby.transform, and then select the rows which had any non-null value in the original df

You would get something like this:

def synchronize3(df,sync_with=[]):
 if isinstance(sync_with, str):
 sync_with = [sync_with, ]
 else:
 sync_with = list(sync_with)
 result = df.groupby('agent').transform(lambda x: x.fillna(method='ffill'))
 if sync_with:
 result = result.loc[pd.notnull(df[sync_with]).any(axis=1), :]
 result = result.assign(agent=df['agent']).reindex(columns=df.columns)
 return result

Results:

%timeit synchronize(foo)

9.48 ms ± 527 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit synchronize(foo,'_c')

29 ms ± 817 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit synchronize(foo,['_a','_b'])

27.8 ms ± 608 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

assert synchronize(foo).equals(synchronize3(foo))
%timeit synchronize3(foo)

5.71 ms ± 935 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日

assert synchronize(foo,'_c').equals(synchronize3(foo,'_c'))
%timeit synchronize3(foo,'_c')

6.41 ms ± 131 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日

assert synchronize(foo,['_a','_b']).equals(synchronize3(foo,['_a','_b']))
%timeit synchronize3(foo,['_a','_b'])

7.33 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
_a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日

bugfixes

the version of your algorithm I used was this:

def synchronize(df,sync_with=[]):
 _df = df.copy()
 if not isinstance(sync_with,list):
 sync_with = [sync_with]
 _fixed_cols = ['date','agent']
 _fixed_cols.extend(sync_with)
 _colset = [c for c in _df.columns if c not in _fixed_cols]
 for ag in _df.agent.unique():
 _df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
 if sync_with:
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].dropna(how='all', subset=sync_with)
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')
 if sync_with:
 _df = _df.dropna(how='all', subset=sync_with)
 return _df.astype({'agent': 'int64'}) # else the `dtype` is different from the one in my method, throwing of the assertions

Question 7

Thanks for the effort, will have a look as soon as possible and let you know. Just for the first part and the use of _ for private I knew but it is used to ease the read of my code, as a convention.

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2018-03-21 09:43:53Z

a few ways to make it more pythonic

'private' variables

variable within a function are limited to the scope of that function, so prepending them with a _ is unnecessary

lists

sync_with could also be another Container, like a tuple or so. Only when it is a str you need to convert it.

also be aware of what can go wrong with mutable default arguments. They are instantiated at the moment of function definition. In this case they are not appended to in the function or returned, so it should not pose any problems, but I try to avoid them anyway

fixed_cols can then be assembled like this:

fixed_cols = ['agent', 'date'] + list(sync_with)

Then only if sync_with is an generator, something can go wrong, so I would do something like this:

def synchronize(df, sync_with=None):
 if sync_with is None:
 sync_with = []
 elif isinstance(sync_with, str):
 sync_with = [sync_with, ]
 else:
 sync_with = list(sync_with)

performance

Making a copy initially is unnecessary, unless you will perform operations inplace on the df
You can use DataFrame.groupby.transform, and then select the rows which had any non-null value in the original df

You would get something like this:

def synchronize3(df,sync_with=[]):
 if isinstance(sync_with, str):
 sync_with = [sync_with, ]
 else:
 sync_with = list(sync_with)
 result = df.groupby('agent').transform(lambda x: x.fillna(method='ffill'))
 if sync_with:
 result = result.loc[pd.notnull(df[sync_with]).any(axis=1), :]
 result = result.assign(agent=df['agent']).reindex(columns=df.columns)
 return result

Results:

%timeit synchronize(foo)

9.48 ms ± 527 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit synchronize(foo,'_c')

29 ms ± 817 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit synchronize(foo,['_a','_b'])

27.8 ms ± 608 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

assert synchronize(foo).equals(synchronize3(foo))
%timeit synchronize3(foo)

5.71 ms ± 935 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日

assert synchronize(foo,'_c').equals(synchronize3(foo,'_c'))
%timeit synchronize3(foo,'_c')

6.41 ms ± 131 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 _a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日

assert synchronize(foo,['_a','_b']).equals(synchronize3(foo,['_a','_b']))
%timeit synchronize3(foo,['_a','_b'])

7.33 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
_a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日

bugfixes

the version of your algorithm I used was this:

def synchronize(df,sync_with=[]):
 _df = df.copy()
 if not isinstance(sync_with,list):
 sync_with = [sync_with]
 _fixed_cols = ['date','agent']
 _fixed_cols.extend(sync_with)
 _colset = [c for c in _df.columns if c not in _fixed_cols]
 for ag in _df.agent.unique():
 _df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
 if sync_with:
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].dropna(how='all', subset=sync_with)
 _df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')
 if sync_with:
 _df = _df.dropna(how='all', subset=sync_with)
 return _df.astype({'agent': 'int64'}) # else the `dtype` is different from the one in my method, throwing of the assertions

Thanks for the effort, will have a look as soon as possible and let you know. Just for the first part and the use of _ for private I knew but it is used to ease the read of my code, as a convention.

Stack Exchange Network

Pandas dataframe custom forward fillna optimisation

Function specifications

Sample

Results

1 Answer 1

'private' variables

lists

performance

Results:

bugfixes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas dataframe custom forward fillna optimisation

Function specifications

Sample

Results

1 Answer 1

'private' variables

lists

performance

Results:

bugfixes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions