This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas
.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv
file by Network
, Product
and month+year
of the Date
column, then calculate the sum of Currency
column keeping track of counts in each group.
I've solved it via creating a separate Month\Year
column first, loading the Date
values into datetime
objects and dumping into a month-year
format, then grouping by the desired columns using .groupby()
and then aggregating with sum
and count
with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas
-based solution? Can it be further improved?
I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.
1 Answer 1
Your approach is already vectorised. That's good! But it could be better.
I've solved it via creating a separate Month\Year column first
Don't do that; there are more efficient options.
Is this the most optimal and readable pandas-based solution?
Not really.
I am particularly not quite happy with renaming columns after aggregation
That can indeed be solved through kwarg-style agg
calls, but that isn't the most major issue. The most major issue is that you should be doing a single groupby
having one of the group levels as a time grouper (essentially the same effect as a resample):
import io
import pandas as pd
with io.StringIO(
'''MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
''') as f:
df = pd.read_csv(f, parse_dates=['Date'], date_format='%d-%b-%Y')
out = (
df.groupby([
'Network', 'Product',
pd.Grouper(key='Date', freq='MS'), # month-start resampling
])['Amount']
.agg(Currency='sum', Count='count')
)
print(out)
Currency Count
Network Product Date
Network 1 Loan Product 1 2016年03月01日 1000 1
Network 2 Loan Product 1 2016年03月01日 1122 1
2016年04月01日 5671 1
Network 3 Loan Product 2 2016年03月01日 5182 2
Date
column is adatetime
you could skip the new column added and useGrouper
, but that leaves you withdatetime
objects instead ofstr
s. I don't really know a way to avoid the renaming at the end \$\endgroup\$Date
is not adatetime
, you can usepd.to_datetime()
instead of thelambda
andstrptime
\$\endgroup\$