2
\$\begingroup\$

I implemented the following code to calculate the YTD sum in Pandas:


def calculateYTDSum(df:pd.DataFrame)->pd.DataFrame:
 '''Calculates the YTD sum of numeric values in a dataframe.
 
 This assumes the input dataframe contains a "quarter" column of type "Quarter"
 '''
 
 ans = (df
 .sort_values(by='quarter', ascending=True)
 .assign(_year = lambda x: x['quarter'].apply(lambda x: x.year))
 .groupby('_year')
 .apply(lambda x: x
 .set_index('quarter')
 .cumsum()
 )
 .drop(columns=['_year'])
 .reset_index()
 .drop(columns=['_year'])
 .sort_values(by='quarter', ascending=False)
 )
 
 return ans

To see it in action consider the following:

@dataclass
class Quarter: # This class is used elsewhere in the codebase
 year:int
 quarter:int
 
 def __repr__(self):
 return f'{self.year} Q{self.quarter}'
 
 def __hash__(self) -> int:
 return self.year*4 + self.quarter
 
 def __lt__(self, other):
 return hash(self) < hash(other)
df = pd.DataFrame({
 'quarter': [Quarter(2020, 4), 
 Quarter(2020, 3),
 Quarter(2020, 2),
 Quarter(2020, 1),
 Quarter(2019, 4),
 Quarter(2019, 3),
 Quarter(2019, 2),
 Quarter(2019, 1)],
 'quantity1' : [1,1,1,1,1,1,1,1],
 'quantity2' : [2,2,2,2,3,3,3,3]
})

Then you have:

df =

quarter quantity1 quantity2
0 2020 Q4 1 2
1 2020 Q3 1 2
2 2020 Q2 1 2
3 2020 Q1 1 2
4 2019 Q4 1 3
5 2019 Q3 1 3
6 2019 Q2 1 3
7 2019 Q1 1 3

and df.pipe(calculateYTDSum) =

quarter quantity1 quantity2
4 2020 Q4 4 8
5 2020 Q3 3 6
6 2020 Q2 2 4
7 2020 Q1 1 2
0 2019 Q4 4 12
1 2019 Q3 3 9
2 2019 Q2 2 6
3 2019 Q1 1 3

However, even for a small sample like the above, the calculation takes ~4ms - and tbh it looks unmaintainable.

I welcome any recommendations on Python tooling, libraries, Pandas extensions, or code changes that would improve the performance and/or simplicity of the code.

asked Jan 14, 2022 at 16:59
\$\endgroup\$
2
  • \$\begingroup\$ Do you need the second drop after the reset_index? \$\endgroup\$ Commented Jan 14, 2022 at 18:32
  • 1
    \$\begingroup\$ Oddly I do, else the _year column from the groupby is retained (in fact I do the first drop to remove _year so that I can reset index) \$\endgroup\$ Commented Jan 14, 2022 at 19:13

1 Answer 1

3
\$\begingroup\$

TL;DR

The current groupby.apply code computes an extra cumsum (_year) and requires a lot of extra index manipulation (set + drop + reset + drop).

Instead use groupby.cumsum, which is more idiomatic and ~20x faster for larger dataframes.


Issues

This groupby.apply adds a lot of overhead:

...groupby('_year').apply(lambda x: x.set_index('quarter').cumsum())
  • Sets an index
  • Computes an extra cumsum over _year
  • Later requires dropping the _year index and _year column

We can see this intermediate state by stopping the chain early:

(df.sort_values(by='quarter', ascending=True)
 .assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
 .groupby('_year').apply(lambda g: g.set_index('quarter').cumsum())
)
# quantity1 quantity2 _year
# _year quarter 
# 2019 2019 Q1 1 3 2019
# 2019 Q2 2 6 4038
# 2019 Q3 3 9 6057
# 2019 Q4 4 12 8076
# 2020 2020 Q1 1 2 2020
# 2020 Q2 2 4 4040
# 2020 Q3 3 6 6060
# 2020 Q4 4 8 8080

Suggestions

groupby.cumsum is fast and idiomatic, but we lose the quarter column:

(df.sort_values(by='quarter', ascending=True)
 .assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
 .groupby('_year').cumsum()
)
# quantity1 quantity2
# 7 1 3
# 6 2 6
# 5 3 9
# 4 4 12
# 3 1 2
# 2 2 4
# 1 3 6
# 0 4 8

So we can just join this groupby.cumsum result back to df[['quarter']]:

df[['quarter']].join(
 df.sort_values(by='quarter', ascending=True)
 .assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
 .groupby('_year').cumsum()
)
# quarter quantity1 quantity2
# 0 2020 Q4 4 8
# 1 2020 Q3 3 6
# 2 2020 Q2 2 4
# 3 2020 Q1 1 2
# 4 2019 Q4 4 12
# 5 2019 Q3 3 9
# 6 2019 Q2 2 6
# 7 2019 Q1 1 3

Timings

At 10K rows, the groupby.cumsum approach is ~21x faster than groupby.apply:

%%timeit
df[['quarter']].join(
 df.sort_values(by='quarter', ascending=True)
 .assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
 .groupby('_year').cumsum()
)
# 74 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
(df.sort_values(by='quarter', ascending=True)
 .assign(_year=lambda x: x['quarter'].apply(lambda q: q.year))
 .groupby('_year').apply(lambda g: g.set_index('quarter').cumsum())
 .drop(columns=['_year'])
 .reset_index()
 .drop(columns=['_year'])
 .sort_values(by='quarter', ascending=False)
)
# 1.58 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing data for reference:

rng = np.random.default_rng(123)
n = 2500
df = pd.DataFrame({
 'quarter': [Quarter(y, q) for y in range(1000, 1000 + n) for q in (4, 3, 2, 1)],
 'quantity1': rng.integers(5, size=n * 4),
 'quantity2': rng.integers(10, size=n * 4),
})
# quarter quantity1 quantity2
# 0 1000 Q4 0 8
# 1 1000 Q3 3 5
# ... ... ... ...
# 9998 3499 Q2 1 7
# 9999 3499 Q1 0 4
# 
# [10000 rows x 3 columns]
answered Jan 16, 2022 at 13:32
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.