Skip to main content
Code Review

Return to Question

Notice removed Draw attention by Community Bot
Bounty Ended with no winning answer by Community Bot
Tweeted twitter.com/StackCodeReview/status/945901619656495104
Notice added Draw attention by alecxe
Bounty Started worth 50 reputation by alecxe
added 172 characters in body
Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be further improved?

I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be improved?

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be further improved?

I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.

Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93

Grouping and aggregating using Pandas

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be improved?

lang-py

AltStyle によって変換されたページ (->オリジナル) /