I have a data frame with columns for three predictors that try to predict the target variable 'change' with values 'yes' and 'no'. The data frame is grouped by id, and for every id, I have 4 years.
Per id number, I want to summarize the combination of the predictors and the change-variable according to the following rules:
- Keep rows where there is any prediction from any predictor (like
any(!is.na(c(pred1,pred2,pred3)))
) - Keep rows where
change == 'yes'
- Keep rows with both conditions 1 and 2.
- If an id meets conditions 1, 2, or 3, delete all rows where all predictors are NA and change == 'no'
- Keep only 1 row when all predictors are NA and change == 'no'
The current code needs two steps to accomplish the goal because it is keeping the rows containing no predictions and no change if they belong to the year 2015. This guarantees condition 5 but requires an additional round of removal to satisfy condition 4.
I've recently shifted my programming style from base R to tidyverse, so I'm a beginner with the package. I was hoping there would be a shorter way to go about this using dplyr.
library(dplyr)
# generate input data. No review needed here, as it is not the original data
set.seed(1)
inputDF <- data.frame(
id = rep(1001:1005, each = 4) ,
year = rep(2015:2018, times = 5) ,
change = c(rep('no',12),'yes', rep('no',2), rep(c('yes', 'no'), 2), 'yes'),
pred1 = c(rep(NA, 10), rep(c(NA,NA,1),2),rep(NA,3),1),
pred2 = c(rep(NA, 8), rep(c(NA,1,NA),2),rep(1,3),rep(NA,3)),
pred3 = c(rep(NA, 8), rep(c(NA,1,1),2),rep(NA,6))) %>%
mutate(across(contains('pred'),
~ if_else(!is.na(.x), rnorm(n(), mean=row_number()), .x) )) %>%
mutate(rownum = row_number())
#' first round of deleting rows summarizes ids
#' with only NA predictions and no change
deleterows1 <- inputDF %>%
group_by(id) %>%
filter(change == 'no' &
is.na(pred1) &
is.na(pred2) &
is.na(pred3) &
year != 2015) %>%
select(c(rownum,id)) %>% as.data.frame()
filteredDF <- inputDF %>%
rows_delete(., deleterows1, by = c('rownum', 'id'))
#' The second round of deleting rows ensures that
#' rows with no predictions and no change are deleted
#' in id groups with predictions and/or chnage
deleterows2 <- filteredDF %>%
group_by(id) %>%
filter(n() > 1) %>%
filter(change == 'no' &
is.na(pred1) &
is.na(pred2) &
is.na(pred3) &
year == 2015) %>%
select(c(rownum,id)) %>% as.data.frame()
filteredDF <- filteredDF %>%
rows_delete(., deleterows2, by = c('rownum', 'id')) %>%
select(-rownum)
1 Answer 1
Fundamentally, you are trying to keep a row if:
- It meets some sorts of conditions that make it a "good" row, or
- It's the first row for an id that has no "good" rows.
To me, then, the way to accomplish this would be to compute which rows are good rows and to filter accordingly. It might look something like this:
filtered2 <- inputDF %>%
mutate(good=(change=="yes" | !is.na(pred1) | !is.na(pred2) | !is.na(pred3))) %>%
group_by(id) %>%
filter(good | (row_number() == 1 & sum(good) == 0)) %>%
select(-good, -rownum)
We can confirm this gives the same results as your code:
identical(as.data.frame(filtered2), as.data.frame(filteredDF))
# [1] TRUE
Note that by defining the good
variable, I avoid needing to repeat the long condition that defines whether a row is good or not. Further, note that I can perform the filtering in one shot by either keeping a row if it's good
or if none of the rows are good for an id (sum(good) == 0
) and it's the first row for the id (row_number() == 1
).
A final comment: a nice aspect of using row_number
instead of filtering on the year being 2015 (the first year in your dataset) is that this will continue working if your data changes (e.g. to year range 2016-2019 instead of 2015-2018).
-
\$\begingroup\$ Great solution! Easy to interpret, scalable, and efficient. Just what I needed \$\endgroup\$saQuist– saQuist2022年01月12日 15:32:30 +00:00Commented Jan 12, 2022 at 15:32