Conditionally drop rows in grouped data.frame using dplyr in R

Question 1

I have a data frame with columns for three predictors that try to predict the target variable 'change' with values 'yes' and 'no'. The data frame is grouped by id, and for every id, I have 4 years.

Per id number, I want to summarize the combination of the predictors and the change-variable according to the following rules:

Keep rows where there is any prediction from any predictor (like any(!is.na(c(pred1,pred2,pred3))))
Keep rows where change == 'yes'
Keep rows with both conditions 1 and 2.
If an id meets conditions 1, 2, or 3, delete all rows where all predictors are NA and change == 'no'
Keep only 1 row when all predictors are NA and change == 'no'

The current code needs two steps to accomplish the goal because it is keeping the rows containing no predictions and no change if they belong to the year 2015. This guarantees condition 5 but requires an additional round of removal to satisfy condition 4.

I've recently shifted my programming style from base R to tidyverse, so I'm a beginner with the package. I was hoping there would be a shorter way to go about this using dplyr.

library(dplyr)
# generate input data. No review needed here, as it is not the original data 
set.seed(1)
inputDF <- data.frame(
 id = rep(1001:1005, each = 4) ,
 year = rep(2015:2018, times = 5) ,
 change = c(rep('no',12),'yes', rep('no',2), rep(c('yes', 'no'), 2), 'yes'),
 pred1 = c(rep(NA, 10), rep(c(NA,NA,1),2),rep(NA,3),1),
 pred2 = c(rep(NA, 8), rep(c(NA,1,NA),2),rep(1,3),rep(NA,3)),
 pred3 = c(rep(NA, 8), rep(c(NA,1,1),2),rep(NA,6))) %>% 
 mutate(across(contains('pred'), 
 ~ if_else(!is.na(.x), rnorm(n(), mean=row_number()), .x) )) %>% 
 mutate(rownum = row_number())
#' first round of deleting rows summarizes ids 
#' with only NA predictions and no change 
deleterows1 <- inputDF %>%
 group_by(id) %>% 
 filter(change == 'no' & 
 is.na(pred1) &
 is.na(pred2) &
 is.na(pred3) &
 year != 2015) %>% 
 select(c(rownum,id)) %>% as.data.frame()
filteredDF <- inputDF %>% 
 rows_delete(., deleterows1, by = c('rownum', 'id'))
#' The second round of deleting rows ensures that 
#' rows with no predictions and no change are deleted
#' in id groups with predictions and/or chnage 
deleterows2 <- filteredDF %>% 
 group_by(id) %>% 
 filter(n() > 1) %>% 
 filter(change == 'no' & 
 is.na(pred1) &
 is.na(pred2) &
 is.na(pred3) &
 year == 2015) %>% 
 select(c(rownum,id)) %>% as.data.frame()
filteredDF <- filteredDF %>%
 rows_delete(., deleterows2, by = c('rownum', 'id')) %>% 
 select(-rownum)

Question 2

Fundamentally, you are trying to keep a row if:

It meets some sorts of conditions that make it a "good" row, or
It's the first row for an id that has no "good" rows.

To me, then, the way to accomplish this would be to compute which rows are good rows and to filter accordingly. It might look something like this:

filtered2 <- inputDF %>%
 mutate(good=(change=="yes" | !is.na(pred1) | !is.na(pred2) | !is.na(pred3))) %>%
 group_by(id) %>%
 filter(good | (row_number() == 1 & sum(good) == 0)) %>%
 select(-good, -rownum)

We can confirm this gives the same results as your code:

identical(as.data.frame(filtered2), as.data.frame(filteredDF))
# [1] TRUE

Note that by defining the good variable, I avoid needing to repeat the long condition that defines whether a row is good or not. Further, note that I can perform the filtering in one shot by either keeping a row if it's good or if none of the rows are good for an id (sum(good) == 0) and it's the first row for the id (row_number() == 1).

A final comment: a nice aspect of using row_number instead of filtering on the year being 2015 (the first year in your dataset) is that this will continue working if your data changes (e.g. to year range 2016-2019 instead of 2015-2018).

Question 3

Great solution! Easy to interpret, scalable, and efficient. Just what I needed

josliber josliber 1,2219 silver badges17 bronze badges · Accepted Answer · 2022-01-12 14:59:59Z

Fundamentally, you are trying to keep a row if:

It meets some sorts of conditions that make it a "good" row, or
It's the first row for an id that has no "good" rows.

To me, then, the way to accomplish this would be to compute which rows are good rows and to filter accordingly. It might look something like this:

filtered2 <- inputDF %>%
 mutate(good=(change=="yes" | !is.na(pred1) | !is.na(pred2) | !is.na(pred3))) %>%
 group_by(id) %>%
 filter(good | (row_number() == 1 & sum(good) == 0)) %>%
 select(-good, -rownum)

We can confirm this gives the same results as your code:

identical(as.data.frame(filtered2), as.data.frame(filteredDF))
# [1] TRUE

Note that by defining the good variable, I avoid needing to repeat the long condition that defines whether a row is good or not. Further, note that I can perform the filtering in one shot by either keeping a row if it's good or if none of the rows are good for an id (sum(good) == 0) and it's the first row for the id (row_number() == 1).

A final comment: a nice aspect of using row_number instead of filtering on the year being 2015 (the first year in your dataset) is that this will continue working if your data changes (e.g. to year range 2016-2019 instead of 2015-2018).

Great solution! Easy to interpret, scalable, and efficient. Just what I needed

Stack Exchange Network

Conditionally drop rows in grouped data.frame using dplyr in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Conditionally drop rows in grouped data.frame using dplyr in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions