I implemented the following code to calculate the YTD sum in Pandas:
def calculateYTDSum(df:pd.DataFrame)->pd.DataFrame:
'''Calculates the YTD sum of numeric values in a dataframe.
This assumes the input dataframe contains a "quarter" column of type "Quarter"
'''
ans = (df
.sort_values(by='quarter', ascending=True)
.assign(_year = lambda x: x['quarter'].apply(lambda x: x.year))
.groupby('_year')
.apply(lambda x: x
.set_index('quarter')
.cumsum()
)
.drop(columns=['_year'])
.reset_index()
.drop(columns=['_year'])
.sort_values(by='quarter', ascending=False)
)
return ans
To see it in action consider the following:
@dataclass
class Quarter: # This class is used elsewhere in the codebase
year:int
quarter:int
def __repr__(self):
return f'{self.year} Q{self.quarter}'
def __hash__(self) -> int:
return self.year*4 + self.quarter
def __lt__(self, other):
return hash(self) < hash(other)
df = pd.DataFrame({
'quarter': [Quarter(2020, 4),
Quarter(2020, 3),
Quarter(2020, 2),
Quarter(2020, 1),
Quarter(2019, 4),
Quarter(2019, 3),
Quarter(2019, 2),
Quarter(2019, 1)],
'quantity1' : [1,1,1,1,1,1,1,1],
'quantity2' : [2,2,2,2,3,3,3,3]
})
Then you have:
df =
quarter | quantity1 | quantity2 | |
---|---|---|---|
0 | 2020 Q4 | 1 | 2 |
1 | 2020 Q3 | 1 | 2 |
2 | 2020 Q2 | 1 | 2 |
3 | 2020 Q1 | 1 | 2 |
4 | 2019 Q4 | 1 | 3 |
5 | 2019 Q3 | 1 | 3 |
6 | 2019 Q2 | 1 | 3 |
7 | 2019 Q1 | 1 | 3 |
and df.pipe(calculateYTDSum) =
quarter | quantity1 | quantity2 | |
---|---|---|---|
4 | 2020 Q4 | 4 | 8 |
5 | 2020 Q3 | 3 | 6 |
6 | 2020 Q2 | 2 | 4 |
7 | 2020 Q1 | 1 | 2 |
0 | 2019 Q4 | 4 | 12 |
1 | 2019 Q3 | 3 | 9 |
2 | 2019 Q2 | 2 | 6 |
3 | 2019 Q1 | 1 | 3 |
However, even for a small sample like the above, the calculation takes ~4ms - and tbh it looks unmaintainable.
I welcome any recommendations on Python tooling, libraries, Pandas extensions, or code changes that would improve the performance and/or simplicity of the code.
1 Answer 1
TL;DR
The current groupby.apply
code computes an extra cumsum
(_year
) and requires a lot of extra index manipulation (set + drop + reset + drop).
Instead use groupby.cumsum
, which is more idiomatic and ~20x faster for larger dataframes.
Issues
This groupby.apply
adds a lot of overhead:
...groupby('_year').apply(lambda x: x.set_index('quarter').cumsum())
- Sets an index
- Computes an extra
cumsum
over_year
- Later requires dropping the
_year
index and_year
column
We can see this intermediate state by stopping the chain early:
(df.sort_values(by='quarter', ascending=True)
.assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
.groupby('_year').apply(lambda g: g.set_index('quarter').cumsum())
)
# quantity1 quantity2 _year
# _year quarter
# 2019 2019 Q1 1 3 2019
# 2019 Q2 2 6 4038
# 2019 Q3 3 9 6057
# 2019 Q4 4 12 8076
# 2020 2020 Q1 1 2 2020
# 2020 Q2 2 4 4040
# 2020 Q3 3 6 6060
# 2020 Q4 4 8 8080
Suggestions
groupby.cumsum
is fast and idiomatic, but we lose the quarter
column:
(df.sort_values(by='quarter', ascending=True)
.assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
.groupby('_year').cumsum()
)
# quantity1 quantity2
# 7 1 3
# 6 2 6
# 5 3 9
# 4 4 12
# 3 1 2
# 2 2 4
# 1 3 6
# 0 4 8
So we can just join
this groupby.cumsum
result back to df[['quarter']]
:
df[['quarter']].join(
df.sort_values(by='quarter', ascending=True)
.assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
.groupby('_year').cumsum()
)
# quarter quantity1 quantity2
# 0 2020 Q4 4 8
# 1 2020 Q3 3 6
# 2 2020 Q2 2 4
# 3 2020 Q1 1 2
# 4 2019 Q4 4 12
# 5 2019 Q3 3 9
# 6 2019 Q2 2 6
# 7 2019 Q1 1 3
Timings
At 10K rows, the groupby.cumsum
approach is ~21x faster than groupby.apply
:
%%timeit
df[['quarter']].join(
df.sort_values(by='quarter', ascending=True)
.assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
.groupby('_year').cumsum()
)
# 74 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
(df.sort_values(by='quarter', ascending=True)
.assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
.groupby('_year').apply(lambda g: g.set_index('quarter').cumsum())
.drop(columns=['_year'])
.reset_index()
.drop(columns=['_year'])
.sort_values(by='quarter', ascending=False)
)
# 1.58 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing data for reference:
rng = np.random.default_rng(123)
n = 2500
df = pd.DataFrame({
'quarter': [Quarter(y, q) for y in range(1000, 1000 + n) for q in (4, 3, 2, 1)],
'quantity1': rng.integers(5, size=n * 4),
'quantity2': rng.integers(10, size=n * 4),
})
# quarter quantity1 quantity2
# 0 1000 Q4 0 8
# 1 1000 Q3 3 5
# ... ... ... ...
# 9998 3499 Q2 1 7
# 9999 3499 Q1 0 4
#
# [10000 rows x 3 columns]
_year
column from thegroupby
is retained (in fact I do the firstdrop
to remove_year
so that I canreset index
) \$\endgroup\$