Function to calculate Persistence Rate with optional group_by variable and logical arguments

Question 1

The Function

"Persistence" is sometimes also referred to as "retention". It is defined as the number of units (ID's) in a given term/period that are also found in the subsequent term/period. So, if I have 10 customers in period 1, and 3 of those customers return in period 2, my persistence rate is 30%.

I have written a function that will either:

Calculate the persistence rate for each period's cohort of ID's if calculate = TRUE.
Create an indicator variable on the original dataframe that
identifies whether the ID persisted (1) or not (0), if calculate = FALSE.

Furthermore, if overall = TRUE when calculate = TRUE, it will include the persistence rate over all of the terms.

The Arguments

Here is a brief description about each of the arguments:

df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.
id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)
rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.
period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)
... (OPTIONAL): Variables to group_by in case a comparison of persistence rates across groups is desired.
overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.
calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.

Perceived Improvement Areas

Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.

Grouping the Optional period Argument: In the section that describes what to do if calculate == TRUE, I had to create an if statement to group the variables differently depending on whether the period argument was supplied. Before, there was only one group_by argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables?
Conditional overall Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuous dplyr chain would be. Is there a more code-efficient way to calculate the overall rate?

What I've Already Tried

I tried to make things a little more efficient by creating the indicator variable 1st, whether or not calculate == TRUE. The I just summarised the persistence_indicator by group. But when I used system.time() to compare performance before and after, my current function was more efficient in almost every combination of arguments. In retrospect, this makes sense. Why create that variable if I don't need it when calculate == TRUE.

I also tried posting an earlier version of my function here on Code Review, just to be completely transparent. It didn't get much attention, which is probably fine since the function has changed so much. But I am still interested in general best practices for improving code, especially as it relates to conditionals.

Sample Data

dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)), 
 period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"), 
 rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3), 
 group = c(1, 2, 1, 2, 1, 2, 1, 1, 2), 
 stringsAsFactors = FALSE)

The Function Code

persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE){
 stopifnot(!missing(df), !missing(id), !missing(rank))
 period_missing <- missing(period)
 
 enq_id <- enquo(id)
 enq_rank <- enquo(rank)
 enq_period <- enquo(period)
 enq_group_var <- quos(...)
 
 valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df)) | is.ordered(rlang::eval_tidy(enq_rank, df))
 
 
 if(!valid_rank_type){
 stop("Argument \"rank\" must be numeric or ordered factor")
 }
 
 if(is.logical(calculate)){
 calculate <- calculate
 } else {
 stop("Argument \"calculate\" must be logical (TRUE/FALSE)")
 }
 
 if(is.logical(overall)){
 overall <- overall
 } else{
 stop("Argument \"overall\" must be logical (TRUE/FALSE)")
 }
 df <- df %>%
 ungroup() %>%
 mutate(denseRank = dense_rank(UQ(enq_rank)))%>%
 group_by(UQ(enq_id))%>%
 arrange(denseRank)%>%
 mutate(nextrank = lead(denseRank))
 
 if(calculate == FALSE){
 
 out <- df %>%
 mutate(persistence_indicator = case_when(nextrank == denseRank + 1 ~ 1,
 TRUE ~ 0))%>%
 ungroup()%>%
 select(-nextrank, -denseRank)
 
 return(out)
 
 } else if (calculate == TRUE) {
 
 if(period_missing){
 out <- df %>%
 group_by(UQS(enq_group_var), UQ(enq_rank), denseRank)
 } else if(!period_missing){
 out <- df %>%
 group_by(UQS(enq_group_var), UQ(enq_rank), UQ(enq_period), denseRank)
 }
 
 
 out <- out %>%
 summarize(persistence_rate = sum(nextrank == (denseRank+1), na.rm = TRUE)/n(),
 count = n()) %>%
 ungroup()%>%
 filter(denseRank != max(denseRank))%>%
 arrange(denseRank) %>%
 select(-denseRank)
 
 if(overall == TRUE){
 total <- df %>%
 ungroup()%>%
 filter(denseRank != max(denseRank))%>%
 summarize(persistence_rate=sum(nextrank == (denseRank + 1), na.rm = TRUE)/n())%>%
 as.numeric()
 
 out <- out %>%
 mutate(overall = total)
 }
 return(out)
 }
}

Sample Function Call, Output, and sessionInfo()

library(dplyr)
persistence(df = dataFrame,
 id = id,
 rank = rank,
 period = period,
 group,
 overall = TRUE,
 calculate = TRUE)
# A tibble: 4 x 6
 group rank period persistence_rate count overall
 <dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857
> sessionInfo()
R version 3.4.2 (2017年09月28日)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States.1252 
attached base packages:
[1] stats graphics grDevices utils datasets methods base 
other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6 
loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2 
 [6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17 
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1

Final Note

The data I use interactively to test this function has about 15,000 rows, so when I mentioned performance above using system.time(), it was with much more data that the sample data I have provided. The sample data works just fine.

Question 2

What exactly you want to improve to this function?

Question 3

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?

Question 4

A few comments in no specific order (mostly from top of your code to bottom):

What you refer to as conditional group_by can be done with group_by_at. Define your values before with an if call, then have a single pipe chain using group_by_at. See also group_if
Your code isn't commented at all (though arguments documented)
tidyverse functions usually use argument names that start with a dot when they contain an ..., so there are fewer chances of argument conflicts.
You call quo and not rlang::quo but you call rlang::eval_tidy, so not 100% consistent (unless it would conflict with another eval_tidy function?).
On the valid_rank_type <- ... line you should use || unless you're comparing vectors or you want it to fail if rhs returns an error and lhs is TRUE
Use if_else rather than case_when if you only have 2 case : case_when(nextrank == denseRank + 1 ~ 1, TRUE ~ 0)) becomes if_else(nextrank == denseRank + 1, 1, 0)
I don't understand what sense make calculate <- calculate or overall <- overall
Instead of using if(calculate == FALSE) and if(calculate == TRUE) you can use if(!calculate) and if(calculate) (as you did with if(period_missing)
It's generally good practice (though not a hard rule) not to use return call in the middle of the code when they can be avoided. In your case you could remove them and add out as a last line before exiting the function.
sum(nextrank == (denseRank+1), na.rm = TRUE)/n() is mean(nextrank == (denseRank+1), na.rm = TRUE)
It's not very "tidyesque" to finish your pipe chain with as.numeric here, though it does the job, the function that would make sense to me here is dplyr::pull

moodymudskipper moodymudskippermoodymudskipper 2961 silver badge6 bronze badges · Accepted Answer · 2018-09-14 08:13:21Z

A few comments in no specific order (mostly from top of your code to bottom):

What you refer to as conditional group_by can be done with group_by_at. Define your values before with an if call, then have a single pipe chain using group_by_at. See also group_if
Your code isn't commented at all (though arguments documented)
tidyverse functions usually use argument names that start with a dot when they contain an ..., so there are fewer chances of argument conflicts.
You call quo and not rlang::quo but you call rlang::eval_tidy, so not 100% consistent (unless it would conflict with another eval_tidy function?).
On the valid_rank_type <- ... line you should use || unless you're comparing vectors or you want it to fail if rhs returns an error and lhs is TRUE
Use if_else rather than case_when if you only have 2 case : case_when(nextrank == denseRank + 1 ~ 1, TRUE ~ 0)) becomes if_else(nextrank == denseRank + 1, 1, 0)
I don't understand what sense make calculate <- calculate or overall <- overall
Instead of using if(calculate == FALSE) and if(calculate == TRUE) you can use if(!calculate) and if(calculate) (as you did with if(period_missing)
It's generally good practice (though not a hard rule) not to use return call in the middle of the code when they can be avoided. In your case you could remove them and add out as a last line before exiting the function.
sum(nextrank == (denseRank+1), na.rm = TRUE)/n() is mean(nextrank == (denseRank+1), na.rm = TRUE)
It's not very "tidyesque" to finish your pipe chain with as.numeric here, though it does the job, the function that would make sense to me here is dplyr::pull

Stack Exchange Network

Function to calculate Persistence Rate with optional group_by variable and logical arguments

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions