So I have the data frame below, and I'm trying to creat a new Data frame which show me the name of the department which have the highest and the lowest employee turnover. The code that I write is correct, but is too big, so I am wondering how can I simplify it. Thanks
My data:
df = data.frame(
department = c("admin", "engineering", "finance", "IT", "logistics", "marketing", "operations", "retail", "sales", "support", "admin", "engineering", "finance", "IT", "logistics", "marketing", "admin", "retail", "admin", "engineering"),
promoted = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1),
review = c(0.4, 0.5, 0.4, 0.8, 0.4, 0.1, 0.9, 0.2, 0.1, 0.1, 0.1, 0.7, 0.1, 0.55, 0.4, 0.33, 0.11, 0.1, 0.11, 0.1),
projects = c(1, 2, 1, 3, 4, 1, 5, 0, 1, 1, 2, 1, 3, 4, 1, 5, 0, 1, 0, 1),
salary = c("low", "medium", "high", "low", "medium", "low", "medium", "low", "low", "low", "medium", "high", "low", "medium", "low", "medium", "low", "low", "low", "medium"),
tenure = c(1, 2, 1, 3, 4, 1, 5, 0, 1, 1, 2, 1, 3, 4, 1, 5, 0, 1, 0, 1),
satisfaction = c(0.4, 0.5, 0.4, 0.8, 0.4, 0.1, 0.9, 0.2, 0.1, 0.1, 0.1, 0.7, 0.1, 0.55, 0.4, 0.33, 0.11, 0.1, 0.11, 0.1),
bonus = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1),
left = c("yes", "no", "yes", "no", "no", "no", "yes", "yes", "no", "yes", "no", "yes", "no", "no", "no", "yes", "yes", "no", "yes", "no"))
My code:
library(tidyverse)
df1 <- df%>%
count(department)
colnames(df1) <- c('department','Total')
df2<- df%>%
filter(left == "yes")%>%
count(department)
colnames(df2) <- c('department','Yes')
df2$Yes<-as.numeric(df2$Yes)
df1$Total<-as.numeric(df1$Total)
df3 <- inner_join(df1, df2)
head(df3, 10)
df3Max <-df3%>%
mutate( turnover = Yes/Total ) %>%
arrange(desc(turnover))
df3Max <- head(df3Max, 1)
df3Min <-df3%>%
mutate( turnover = Yes/Total ) %>%
arrange(turnover)
df3Min <- head(df3Min, 1)
Turnover <- rbind(df3Max, df3Min)
```
-
1\$\begingroup\$ At the moment your code removes any departments with no turnover, so it's returning the department with the smallest non-zero turnover. Is this the intended behavior? \$\endgroup\$josliber– josliber2022年02月09日 16:57:35 +00:00Commented Feb 9, 2022 at 16:57
-
1\$\begingroup\$ The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How do I ask a good question?. \$\endgroup\$BCdotWEB– BCdotWEB2022年02月10日 06:52:01 +00:00Commented Feb 10, 2022 at 6:52
1 Answer 1
Right now your code separately computes the total number in each department (df1
) and the number who left (df2
) and then joins these results to get department summaries (df3
). However, this would be more efficient as a grouped operation:
df %>%
group_by(department) %>%
summarize(Total=n(), Yes=sum(left=="yes"))
# # A tibble: 10 ×ばつ 3
# department Total Yes
# <fct> <int> <int>
# 1 admin 4 3
# 2 engineering 3 1
# 3 finance 2 1
# 4 IT 2 0
# 5 logistics 2 0
# 6 marketing 2 1
# 7 operations 1 1
# 8 retail 2 1
# 9 sales 1 0
# 10 support 1 1
Beyond being more compact code, this helps you not have to think about various details (e.g. inner join versus outer join when combining df1
and df2
).
Now you basically just need to create your turnover variable, order by turnover, and grab the top and bottom. It turns out grabbing the top and bottom can be efficiently handled with slice
(see more here), which prevents you from needing to separately grab the top (df3Max
) and bottom (df3Min
) and then combine:
df %>%
group_by(department) %>%
summarize(Total=n(), Yes=sum(left=="yes")) %>%
ungroup() %>%
mutate(turnover=Yes/Total) %>%
arrange(turnover) %>%
slice(c(1, n()))
# # A tibble: 2 ×ばつ 4
# department Total Yes turnover
# <fct> <int> <int> <dbl>
# 1 IT 2 0 0
# 2 support 1 1 1
Note that grabbing the top and bottom in this way also avoids code repetition in defining turnover
and in sorting.
Your code in the question removes any department with no turnover due to its use of an inner join between df1
and df2
. If that's the desired behavior, then you can just add in a filter(Yes > 0)
to replicate that behavior.