I have a function that synchronizes according to some specifications columns of a dataframe
.
The functions works, nevertheless I was wondering how to:
- Improve performances
- Make it more pythonic
Please feel free to leave any suggestions.
Function specifications
Inputs:
df
: adataframe
with columns:[a0,...aN]
:a0
toaN
names can be any validstring
and containsnumeric
values[agent,date]
: are fixed names,agent
containsnumeric
values anddate
containsdatetime
.
sync_with
: The columns to synchronize with (astring
or anlistof string
contained in[a0,..., aN]
or, by default, anempty list
to synchronize all the[a0,...,aN]
.
Synchronization:
- Do a forward
fillna
grouped by agent values. - Drop the rows where all columns to synchronize with values are empty
- Do a forward
Returns : the synchronized
dataframe
Here is my function:
import pandas as pd
import numpy as np
def synchronize(df,sync_with=[]):
_df = df.copy()
if not isinstance(sync_with,list):
sync_with = [sync_with]
_fixed_cols = ['date','agent']
_fixed_cols.extend(sync_with)
_colset = [c for c in _df.columns if c not in _fixed_cols]
for ag in _df.agent.unique():
_df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
if _sync_with:
_df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].dropna(how='all', subset=_sync_with)
_df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')
return _df.dropna(how='all', subset=_sync_with)
Sample
foo = pd.DataFrame(dict(date=pd.to_datetime(['2010', '2011', '2012', '2013', '2010', '2013', '2015', '2016']),
agent=[1,1,1,1,2,2,2,2],
_a=[1, np.nan, np.nan, 4, 5, np.nan, 7, 8],
_b=[11, 22, np.nan, np.nan, 55, np.nan, 77, np.nan],
_c=[111, np.nan, 333, np.nan, np.nan, 666, 777, np.nan]))
Results
# 1. default (13 ms per loop)
print(synchronize(foo))
_a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日
# 2. sync with one column (35 ms per loop)
print(synchronize(foo,'_c'))
_a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
2 1.0 22.0 333.0 1 2012年01月01日
5 5.0 55.0 666.0 2 2013年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
# 3. sync with two columns (35 ms per loop)
print(synchronize(foo,['_a','_b']))
_a _b _c agent date
0 1.0 11.0 111.0 1 2010年01月01日
1 1.0 22.0 111.0 1 2011年01月01日
3 4.0 22.0 333.0 1 2013年01月01日
4 5.0 55.0 NaN 2 2010年01月01日
6 7.0 77.0 777.0 2 2015年01月01日
7 8.0 77.0 777.0 2 2016年01月01日
1 Answer 1
a few ways to make it more pythonic
'private' variables
variable within a function are limited to the scope of that function, so prepending them with a _
is unnecessary
lists
sync_with
could also be another Container, like a tuple
or so. Only when it is a str
you need to convert it.
also be aware of what can go wrong with mutable default arguments. They are instantiated at the moment of function definition. In this case they are not appended to in the function or returned, so it should not pose any problems, but I try to avoid them anyway
fixed_cols
can then be assembled like this:
fixed_cols = ['agent', 'date'] + list(sync_with)
Then only if sync_with
is an generator, something can go wrong, so I would do something like this:
def synchronize(df, sync_with=None):
if sync_with is None:
sync_with = []
elif isinstance(sync_with, str):
sync_with = [sync_with, ]
else:
sync_with = list(sync_with)
performance
- Making a copy initially is unnecessary, unless you will perform operations inplace on the
df
- You can use
DataFrame.groupby.transform
, and then select the rows which had any non-null value in the original df
You would get something like this:
def synchronize3(df,sync_with=[]):
if isinstance(sync_with, str):
sync_with = [sync_with, ]
else:
sync_with = list(sync_with)
result = df.groupby('agent').transform(lambda x: x.fillna(method='ffill'))
if sync_with:
result = result.loc[pd.notnull(df[sync_with]).any(axis=1), :]
result = result.assign(agent=df['agent']).reindex(columns=df.columns)
return result
Results:
%timeit synchronize(foo)
9.48 ms ± 527 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit synchronize(foo,'_c')
29 ms ± 817 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit synchronize(foo,['_a','_b'])
27.8 ms ± 608 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
assert synchronize(foo).equals(synchronize3(foo))
%timeit synchronize3(foo)
5.71 ms ± 935 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) _a _b _c agent date 0 1.0 11.0 111.0 1 2010年01月01日 1 1.0 22.0 111.0 1 2011年01月01日 2 1.0 22.0 333.0 1 2012年01月01日 3 4.0 22.0 333.0 1 2013年01月01日 4 5.0 55.0 NaN 2 2010年01月01日 5 5.0 55.0 666.0 2 2013年01月01日 6 7.0 77.0 777.0 2 2015年01月01日 7 8.0 77.0 777.0 2 2016年01月01日
assert synchronize(foo,'_c').equals(synchronize3(foo,'_c'))
%timeit synchronize3(foo,'_c')
6.41 ms ± 131 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) _a _b _c agent date 0 1.0 11.0 111.0 1 2010年01月01日 2 1.0 22.0 333.0 1 2012年01月01日 5 5.0 55.0 666.0 2 2013年01月01日 6 7.0 77.0 777.0 2 2015年01月01日
assert synchronize(foo,['_a','_b']).equals(synchronize3(foo,['_a','_b']))
%timeit synchronize3(foo,['_a','_b'])
7.33 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) _a _b _c agent date 0 1.0 11.0 111.0 1 2010年01月01日 1 1.0 22.0 111.0 1 2011年01月01日 3 4.0 22.0 333.0 1 2013年01月01日 4 5.0 55.0 NaN 2 2010年01月01日 6 7.0 77.0 777.0 2 2015年01月01日 7 8.0 77.0 777.0 2 2016年01月01日
bugfixes
the version of your algorithm I used was this:
def synchronize(df,sync_with=[]):
_df = df.copy()
if not isinstance(sync_with,list):
sync_with = [sync_with]
_fixed_cols = ['date','agent']
_fixed_cols.extend(sync_with)
_colset = [c for c in _df.columns if c not in _fixed_cols]
for ag in _df.agent.unique():
_df.loc[_df.agent==ag,_colset] = _df.loc[_df.agent==ag,_colset].fillna(method='ffill')
if sync_with:
_df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].dropna(how='all', subset=sync_with)
_df.loc[_df.agent==ag,:] = _df.loc[_df.agent==ag,:].fillna(method='ffill')
if sync_with:
_df = _df.dropna(how='all', subset=sync_with)
return _df.astype({'agent': 'int64'}) # else the `dtype` is different from the one in my method, throwing of the assertions
-
\$\begingroup\$ Thanks for the effort, will have a look as soon as possible and let you know. Just for the first part and the use of
_
for private I knew but it is used to ease the read of my code, as a convention. \$\endgroup\$David Leon– David Leon2018年03月21日 09:58:02 +00:00Commented Mar 21, 2018 at 9:58
Explore related questions
See similar questions with these tags.
5 5.0 55.0 666.0
. Which is correct? \$\endgroup\$_df.loc[_df.agent==ag,_colset]
skip the index 4... \$\endgroup\$