7
\$\begingroup\$

I have a csv file that looks like this:

,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019年02月01日,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017年01月17日,0
2,43,Technology,Undergraduate,Referral,4,1.0,2,0,59143.5,Female,2012年08月27日,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017年07月25日,1
4,33,HR,Undergraduate,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019年05月17日,1
5,40,Purchasing,Undergraduate,Walk-in,3,3.0,7,1,38177.1,Male,2004年04月22日,1
6,26,Purchasing,Undergraduate,Referral,5,5.0,2,0,78075.0,Male,2019年12月10日,1
7,25,Technology,Undergraduate,Recruitment Agency,1,1.0,4,0,21668.4,Female,2017年03月18日,0
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015年04月02日,1
9,45,Technology,Postgraduate,Referral,3,3.0,9,0,38177.1,Female,2004年03月19日,0
10,31,Marketing,Undergraduate,Walk-in,4,4.0,6,0,59143.5,Male,2009年01月24日,1
11,43,Technology,Postgraduate,Recruitment Agency,2,1.0,9,1,26824.5,Male,2016年03月10日,1
12,28,Technology,Undergraduate,On-Campus,3,4.0,0,0,38177.1,Female,2013年04月24日,0
13,48,Purchasing,Postgraduate,Referral,3,4.0,8,0,38177.1,Male,2010年07月25日,1
14,52,Purchasing,Postgraduate,Recruitment Agency,5,1.0,7,0,78075.0,Male,2018年02月07日,1
15,50,Purchasing,Undergraduate,Recruitment Agency,5,5.0,6,0,78075.0,Male,2014年04月24日,1
16,34,Marketing,Postgraduate,On-Campus,1,4.0,9,0,21668.4,Male,2014年12月10日,0
17,24,Purchasing,Undergraduate,Recruitment Agency,4,4.0,6,0,59143.5,Female,2018年02月18日,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014年05月07日,1
19,25,Sales,Undergraduate,Recruitment Agency,5,4.0,4,0,78075.0,Male,2012年02月15日,1
20,35,HR,Undergraduate,On-Campus,2,4.0,4,0,26824.5,Female,2008年01月15日,1
21,50,HR,Postgraduate,Referral,5,4.0,0,0,78075.0,Male,2015年04月13日,1
22,34,Purchasing,Postgraduate,Referral,4,2.0,7,1,59143.5,Male,2013年07月02日,1
23,37,Sales,Undergraduate,Recruitment Agency,5,5.0,0,1,78075.0,Male,2016年03月22日,1
24,31,Sales,Postgraduate,Walk-in,4,4.0,3,1,59143.5,Female,2006年09月05日,1
25,53,Sales,Postgraduate,Walk-in,4,5.0,8,1,59143.5,Female,2005年10月08日,1
26,45,Marketing,Undergraduate,Walk-in,4,3.0,8,0,59143.5,Male,2008年01月08日,1
27,40,Purchasing,Undergraduate,Walk-in,4,3.0,4,1,59143.5,Female,2005年11月19日,0

The question that should be answered is how many people are recruited per department as a function of time. This should be shown in a line chart.

This was my solution:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("employees_satisfaction_transformed.csv", index_col=0)
recruitment_groups = df.groupby("recruitment_type")
campus = recruitment_groups.get_group("On-Campus")["entry_date"]
walk_in = recruitment_groups.get_group("Walk-in")["entry_date"]
referral = recruitment_groups.get_group("Referral")["entry_date"]
agency = recruitment_groups.get_group("Recruitment Agency")["entry_date"]
campus = campus.sort_values().reset_index()
campus['index'] = campus.index
walk_in = walk_in.sort_values().reset_index()
walk_in['index'] = walk_in.index
referral = referral.sort_values().reset_index()
referral['index'] = referral.index
agency = agency.sort_values().reset_index()
agency['index'] = agency.index
plt.plot(campus['entry_date'], campus['index'], label="campus")
plt.plot(walk_in['entry_date'], walk_in['index'], label="walk_in")
plt.plot(referral['entry_date'], referral['index'], label="referral")
plt.plot(agency['entry_date'], agency['index'], label="agency")
plt.legend(loc='best')
plt.show()

I'm sort of new to pandas so any critique is welcome.

asked May 28, 2023 at 23:12
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

J_H's advice about DRY is good for Python in general.

However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.

Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby methods:

So this is a more idiomatic DRY version in Pandas:

# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)
# sort once
df = df.sort_values('entry_date')
# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()
# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)

output of revised code

answered Jul 19, 2023 at 18:38
\$\endgroup\$
3
\$\begingroup\$

break out helpers

You serially assign these four variables:

  • campus
  • walk_in
  • referral
  • agency

This code is just crying out to you to define a helper method and then iterate over those four columns. Which would include the whole sort / reset thing.

There's an opportunity for a for loop to do some plotting, but that's a separate item.


define function

You created a bunch of top-level global variables. To reduce coupling define them within a function, perhaps def main():, so they go out of scope once the function exits and then they won't pollute the global namespace.

answered May 29, 2023 at 4:18
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.