Return to Question

Notice removed Draw attention by Community Bot

occurred Dec 29, 2017 at 2:29

Bounty Ended with no winning answer by Community Bot

occurred Dec 29, 2017 at 2:29

Tweeted twitter.com/StackCodeReview/status/945901619656495104

occurred Dec 27, 2017 at 6:18

Notice added Draw attention by alecxe

occurred Dec 21, 2017 at 1:14

Bounty Started worth 50 reputation by alecxe

occurred Dec 21, 2017 at 1:14

added 172 characters in body

Source Link

edited Aug 23, 2017 at 0:47

alecxe

edited Aug 23, 2017 at 0:47

alecxe

17.5k
8
52
93

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.

I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be further improved?

I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be improved?

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be further improved?

I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.

Source Link

asked Aug 22, 2017 at 19:50

alecxe

asked Aug 22, 2017 at 19:50

alecxe

17.5k
8
52
93

Grouping and aggregating using Pandas

This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.

To recap, sample input:

MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671

Desired output:

Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2

from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)

Is this the most optimal and readable pandas-based solution? Can it be improved?

python pandas

lang-py