I have a csv file that looks like this:
,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019年02月01日,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017年01月17日,0
2,43,Technology,Undergraduate,Referral,4,1.0,2,0,59143.5,Female,2012年08月27日,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017年07月25日,1
4,33,HR,Undergraduate,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019年05月17日,1
5,40,Purchasing,Undergraduate,Walk-in,3,3.0,7,1,38177.1,Male,2004年04月22日,1
6,26,Purchasing,Undergraduate,Referral,5,5.0,2,0,78075.0,Male,2019年12月10日,1
7,25,Technology,Undergraduate,Recruitment Agency,1,1.0,4,0,21668.4,Female,2017年03月18日,0
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015年04月02日,1
9,45,Technology,Postgraduate,Referral,3,3.0,9,0,38177.1,Female,2004年03月19日,0
10,31,Marketing,Undergraduate,Walk-in,4,4.0,6,0,59143.5,Male,2009年01月24日,1
11,43,Technology,Postgraduate,Recruitment Agency,2,1.0,9,1,26824.5,Male,2016年03月10日,1
12,28,Technology,Undergraduate,On-Campus,3,4.0,0,0,38177.1,Female,2013年04月24日,0
13,48,Purchasing,Postgraduate,Referral,3,4.0,8,0,38177.1,Male,2010年07月25日,1
14,52,Purchasing,Postgraduate,Recruitment Agency,5,1.0,7,0,78075.0,Male,2018年02月07日,1
15,50,Purchasing,Undergraduate,Recruitment Agency,5,5.0,6,0,78075.0,Male,2014年04月24日,1
16,34,Marketing,Postgraduate,On-Campus,1,4.0,9,0,21668.4,Male,2014年12月10日,0
17,24,Purchasing,Undergraduate,Recruitment Agency,4,4.0,6,0,59143.5,Female,2018年02月18日,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014年05月07日,1
19,25,Sales,Undergraduate,Recruitment Agency,5,4.0,4,0,78075.0,Male,2012年02月15日,1
20,35,HR,Undergraduate,On-Campus,2,4.0,4,0,26824.5,Female,2008年01月15日,1
21,50,HR,Postgraduate,Referral,5,4.0,0,0,78075.0,Male,2015年04月13日,1
22,34,Purchasing,Postgraduate,Referral,4,2.0,7,1,59143.5,Male,2013年07月02日,1
23,37,Sales,Undergraduate,Recruitment Agency,5,5.0,0,1,78075.0,Male,2016年03月22日,1
24,31,Sales,Postgraduate,Walk-in,4,4.0,3,1,59143.5,Female,2006年09月05日,1
25,53,Sales,Postgraduate,Walk-in,4,5.0,8,1,59143.5,Female,2005年10月08日,1
26,45,Marketing,Undergraduate,Walk-in,4,3.0,8,0,59143.5,Male,2008年01月08日,1
27,40,Purchasing,Undergraduate,Walk-in,4,3.0,4,1,59143.5,Female,2005年11月19日,0
The question that should be answered is how many people are recruited per department as a function of time. This should be shown in a line chart.
This was my solution:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("employees_satisfaction_transformed.csv", index_col=0)
recruitment_groups = df.groupby("recruitment_type")
campus = recruitment_groups.get_group("On-Campus")["entry_date"]
walk_in = recruitment_groups.get_group("Walk-in")["entry_date"]
referral = recruitment_groups.get_group("Referral")["entry_date"]
agency = recruitment_groups.get_group("Recruitment Agency")["entry_date"]
campus = campus.sort_values().reset_index()
campus['index'] = campus.index
walk_in = walk_in.sort_values().reset_index()
walk_in['index'] = walk_in.index
referral = referral.sort_values().reset_index()
referral['index'] = referral.index
agency = agency.sort_values().reset_index()
agency['index'] = agency.index
plt.plot(campus['entry_date'], campus['index'], label="campus")
plt.plot(walk_in['entry_date'], walk_in['index'], label="walk_in")
plt.plot(referral['entry_date'], referral['index'], label="referral")
plt.plot(agency['entry_date'], agency['index'], label="agency")
plt.legend(loc='best')
plt.show()
I'm sort of new to pandas so any critique is welcome.
2 Answers 2
J_H's advice about DRY is good for Python in general.
However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.
Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby
methods:
So this is a more idiomatic DRY version in Pandas:
# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)
# sort once
df = df.sort_values('entry_date')
# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()
# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)
break out helpers
You serially assign these four variables:
- campus
- walk_in
- referral
- agency
This code is just crying out to you to define a helper method and then iterate over those four columns. Which would include the whole sort / reset thing.
There's an opportunity for a for
loop
to do some plotting, but that's a separate item.
define function
You created a bunch of top-level global variables. To reduce
coupling
define them within a function,
perhaps def main():
,
so they go out of scope once the function exits
and then they won't pollute the global namespace.