Grouping and aggregating using Pandas

Question 1

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be further improved?

I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.

Question 2

If the Date column is a datetime you could skip the new column added and use Grouper, but that leaves you with datetime objects instead of strs. I don't really know a way to avoid the renaming at the end

Question 3

if the Date is not a datetime, you can use pd.to_datetime() instead of the lambda and strptime

Question 4

Your approach is already vectorised. That's good! But it could be better.

I've solved it via creating a separate Month\Year column first

Don't do that; there are more efficient options.

Is this the most optimal and readable pandas-based solution?

Not really.

I am particularly not quite happy with renaming columns after aggregation

That can indeed be solved through kwarg-style agg calls, but that isn't the most major issue. The most major issue is that you should be doing a single groupby having one of the group levels as a time grouper (essentially the same effect as a resample):

import io
import pandas as pd
with io.StringIO(
'''MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
''') as f:
 df = pd.read_csv(f, parse_dates=['Date'], date_format='%d-%b-%Y')
out = (
 df.groupby([
 'Network', 'Product',
 pd.Grouper(key='Date', freq='MS'), # month-start resampling
 ])['Amount']
 .agg(Currency='sum', Count='count')
)
print(out)

 Currency Count
Network Product Date 
Network 1 Loan Product 1 2016年03月01日 1000 1
Network 2 Loan Product 1 2016年03月01日 1122 1
 2016年04月01日 5671 1
Network 3 Loan Product 2 2016年03月01日 5182 2

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2025-01-27 23:25:58Z

Your approach is already vectorised. That's good! But it could be better.

I've solved it via creating a separate Month\Year column first

Don't do that; there are more efficient options.

Is this the most optimal and readable pandas-based solution?

Not really.

I am particularly not quite happy with renaming columns after aggregation

That can indeed be solved through kwarg-style agg calls, but that isn't the most major issue. The most major issue is that you should be doing a single groupby having one of the group levels as a time grouper (essentially the same effect as a resample):

import io
import pandas as pd
with io.StringIO(
'''MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
''') as f:
 df = pd.read_csv(f, parse_dates=['Date'], date_format='%d-%b-%Y')
out = (
 df.groupby([
 'Network', 'Product',
 pd.Grouper(key='Date', freq='MS'), # month-start resampling
 ])['Amount']
 .agg(Currency='sum', Count='count')
)
print(out)

 Currency Count
Network Product Date 
Network 1 Loan Product 1 2016年03月01日 1000 1
Network 2 Loan Product 1 2016年03月01日 1122 1
 2016年04月01日 5671 1
Network 3 Loan Product 2 2016年03月01日 5182 2

Stack Exchange Network

Grouping and aggregating using Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Grouping and aggregating using Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions