groupby in pandas and plot

Question 1

I have a csv file that looks like this:

,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019年02月01日,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017年01月17日,0
2,43,Technology,Undergraduate,Referral,4,1.0,2,0,59143.5,Female,2012年08月27日,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017年07月25日,1
4,33,HR,Undergraduate,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019年05月17日,1
5,40,Purchasing,Undergraduate,Walk-in,3,3.0,7,1,38177.1,Male,2004年04月22日,1
6,26,Purchasing,Undergraduate,Referral,5,5.0,2,0,78075.0,Male,2019年12月10日,1
7,25,Technology,Undergraduate,Recruitment Agency,1,1.0,4,0,21668.4,Female,2017年03月18日,0
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015年04月02日,1
9,45,Technology,Postgraduate,Referral,3,3.0,9,0,38177.1,Female,2004年03月19日,0
10,31,Marketing,Undergraduate,Walk-in,4,4.0,6,0,59143.5,Male,2009年01月24日,1
11,43,Technology,Postgraduate,Recruitment Agency,2,1.0,9,1,26824.5,Male,2016年03月10日,1
12,28,Technology,Undergraduate,On-Campus,3,4.0,0,0,38177.1,Female,2013年04月24日,0
13,48,Purchasing,Postgraduate,Referral,3,4.0,8,0,38177.1,Male,2010年07月25日,1
14,52,Purchasing,Postgraduate,Recruitment Agency,5,1.0,7,0,78075.0,Male,2018年02月07日,1
15,50,Purchasing,Undergraduate,Recruitment Agency,5,5.0,6,0,78075.0,Male,2014年04月24日,1
16,34,Marketing,Postgraduate,On-Campus,1,4.0,9,0,21668.4,Male,2014年12月10日,0
17,24,Purchasing,Undergraduate,Recruitment Agency,4,4.0,6,0,59143.5,Female,2018年02月18日,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014年05月07日,1
19,25,Sales,Undergraduate,Recruitment Agency,5,4.0,4,0,78075.0,Male,2012年02月15日,1
20,35,HR,Undergraduate,On-Campus,2,4.0,4,0,26824.5,Female,2008年01月15日,1
21,50,HR,Postgraduate,Referral,5,4.0,0,0,78075.0,Male,2015年04月13日,1
22,34,Purchasing,Postgraduate,Referral,4,2.0,7,1,59143.5,Male,2013年07月02日,1
23,37,Sales,Undergraduate,Recruitment Agency,5,5.0,0,1,78075.0,Male,2016年03月22日,1
24,31,Sales,Postgraduate,Walk-in,4,4.0,3,1,59143.5,Female,2006年09月05日,1
25,53,Sales,Postgraduate,Walk-in,4,5.0,8,1,59143.5,Female,2005年10月08日,1
26,45,Marketing,Undergraduate,Walk-in,4,3.0,8,0,59143.5,Male,2008年01月08日,1
27,40,Purchasing,Undergraduate,Walk-in,4,3.0,4,1,59143.5,Female,2005年11月19日,0

The question that should be answered is how many people are recruited per department as a function of time. This should be shown in a line chart.

This was my solution:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("employees_satisfaction_transformed.csv", index_col=0)
recruitment_groups = df.groupby("recruitment_type")
campus = recruitment_groups.get_group("On-Campus")["entry_date"]
walk_in = recruitment_groups.get_group("Walk-in")["entry_date"]
referral = recruitment_groups.get_group("Referral")["entry_date"]
agency = recruitment_groups.get_group("Recruitment Agency")["entry_date"]
campus = campus.sort_values().reset_index()
campus['index'] = campus.index
walk_in = walk_in.sort_values().reset_index()
walk_in['index'] = walk_in.index
referral = referral.sort_values().reset_index()
referral['index'] = referral.index
agency = agency.sort_values().reset_index()
agency['index'] = agency.index
plt.plot(campus['entry_date'], campus['index'], label="campus")
plt.plot(walk_in['entry_date'], walk_in['index'], label="walk_in")
plt.plot(referral['entry_date'], referral['index'], label="referral")
plt.plot(agency['entry_date'], agency['index'], label="agency")
plt.legend(loc='best')
plt.show()

I'm sort of new to pandas so any critique is welcome.

Question 2

J_H's advice about DRY is good for Python in general.

However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.

Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby methods:

So this is a more idiomatic DRY version in Pandas:

# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)
# sort once
df = df.sort_values('entry_date')
# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()
# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)

output of revised code

Question 3

break out helpers

You serially assign these four variables:

campus
walk_in
referral
agency

This code is just crying out to you to define a helper method and then iterate over those four columns. Which would include the whole sort / reset thing.

There's an opportunity for a for loop to do some plotting, but that's a separate item.

define function

You created a bunch of top-level global variables. To reduce coupling define them within a function, perhaps def main():, so they go out of scope once the function exits and then they won't pollute the global namespace.

tdy tdytdy 2,2661 gold badge10 silver badges21 bronze badges · Accepted Answer · 2023-07-19 18:38:49Z

J_H's advice about DRY is good for Python in general.

However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.

Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby methods:

So this is a more idiomatic DRY version in Pandas:

# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)
# sort once
df = df.sort_values('entry_date')
# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()
# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)

output of revised code

Stack Exchange Network

groupby in pandas and plot

2 Answers 2

break out helpers

define function

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

groupby in pandas and plot

2 Answers 2

break out helpers

define function

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions