This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas
.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv
file by Network
, Product
and month+year
of the Date
column, then calculate the sum of Currency
column keeping track of counts in each group.
I've solved it via creating a separate Month\Year
column first, loading the Date
values into datetime
objects and dumping into a month-year
format, then grouping by the desired columns using .groupby()
and then aggregating with sum
and count
with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas
-based solution? Can it be further improved?
I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.
This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas
.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv
file by Network
, Product
and month+year
of the Date
column, then calculate the sum of Currency
column keeping track of counts in each group.
I've solved it via creating a separate Month\Year
column first, loading the Date
values into datetime
objects and dumping into a month-year
format, then grouping by the desired columns using .groupby()
and then aggregating with sum
and count
with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas
-based solution? Can it be improved?
This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas
.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv
file by Network
, Product
and month+year
of the Date
column, then calculate the sum of Currency
column keeping track of counts in each group.
I've solved it via creating a separate Month\Year
column first, loading the Date
values into datetime
objects and dumping into a month-year
format, then grouping by the desired columns using .groupby()
and then aggregating with sum
and count
with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas
-based solution? Can it be further improved?
I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.
Grouping and aggregating using Pandas
This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas
.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv
file by Network
, Product
and month+year
of the Date
column, then calculate the sum of Currency
column keeping track of counts in each group.
I've solved it via creating a separate Month\Year
column first, loading the Date
values into datetime
objects and dumping into a month-year
format, then grouping by the desired columns using .groupby()
and then aggregating with sum
and count
with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas
-based solution? Can it be improved?