The Function
"Persistence" is sometimes also referred to as "retention". It is defined as the number of units (ID's) in a given term/period that are also found in the subsequent term/period. So, if I have 10 customers in period 1, and 3 of those customers return in period 2, my persistence rate is 30%.
I have written a function that will either:
Calculate the persistence rate for each period's cohort of ID's if
calculate = TRUE
.Create an indicator variable on the original dataframe that
identifies whether the ID persisted (1) or not (0), ifcalculate = FALSE
.
Furthermore, if overall = TRUE
when calculate = TRUE
, it will include the persistence rate over all of the terms.
The Arguments
Here is a brief description about each of the arguments:
- df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.
- id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)
- rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.
- period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)
- ... (OPTIONAL): Variables to
group_by
in case a comparison of persistence rates across groups is desired. - overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.
- calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.
Perceived Improvement Areas
Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.
- Grouping the Optional
period
Argument: In the section that describes what to do ifcalculate == TRUE
, I had to create anif
statement to group the variables differently depending on whether theperiod
argument was supplied. Before, there was only onegroup_by
argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables? - Conditional
overall
Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuousdplyr
chain would be. Is there a more code-efficient way to calculate the overall rate?
What I've Already Tried
I tried to make things a little more efficient by creating the indicator variable 1st, whether or not calculate == TRUE
. The I just summarised the persistence_indicator
by group. But when I used system.time()
to compare performance before and after, my current function was more efficient in almost every combination of arguments. In retrospect, this makes sense. Why create that variable if I don't need it when calculate == TRUE
.
I also tried posting an earlier version of my function here on Code Review, just to be completely transparent. It didn't get much attention, which is probably fine since the function has changed so much. But I am still interested in general best practices for improving code, especially as it relates to conditionals.
Sample Data
dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)),
period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"),
rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
group = c(1, 2, 1, 2, 1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
The Function Code
persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE){
stopifnot(!missing(df), !missing(id), !missing(rank))
period_missing <- missing(period)
enq_id <- enquo(id)
enq_rank <- enquo(rank)
enq_period <- enquo(period)
enq_group_var <- quos(...)
valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df)) | is.ordered(rlang::eval_tidy(enq_rank, df))
if(!valid_rank_type){
stop("Argument \"rank\" must be numeric or ordered factor")
}
if(is.logical(calculate)){
calculate <- calculate
} else {
stop("Argument \"calculate\" must be logical (TRUE/FALSE)")
}
if(is.logical(overall)){
overall <- overall
} else{
stop("Argument \"overall\" must be logical (TRUE/FALSE)")
}
df <- df %>%
ungroup() %>%
mutate(denseRank = dense_rank(UQ(enq_rank)))%>%
group_by(UQ(enq_id))%>%
arrange(denseRank)%>%
mutate(nextrank = lead(denseRank))
if(calculate == FALSE){
out <- df %>%
mutate(persistence_indicator = case_when(nextrank == denseRank + 1 ~ 1,
TRUE ~ 0))%>%
ungroup()%>%
select(-nextrank, -denseRank)
return(out)
} else if (calculate == TRUE) {
if(period_missing){
out <- df %>%
group_by(UQS(enq_group_var), UQ(enq_rank), denseRank)
} else if(!period_missing){
out <- df %>%
group_by(UQS(enq_group_var), UQ(enq_rank), UQ(enq_period), denseRank)
}
out <- out %>%
summarize(persistence_rate = sum(nextrank == (denseRank+1), na.rm = TRUE)/n(),
count = n()) %>%
ungroup()%>%
filter(denseRank != max(denseRank))%>%
arrange(denseRank) %>%
select(-denseRank)
if(overall == TRUE){
total <- df %>%
ungroup()%>%
filter(denseRank != max(denseRank))%>%
summarize(persistence_rate=sum(nextrank == (denseRank + 1), na.rm = TRUE)/n())%>%
as.numeric()
out <- out %>%
mutate(overall = total)
}
return(out)
}
}
Sample Function Call, Output, and sessionInfo()
library(dplyr)
persistence(df = dataFrame,
id = id,
rank = rank,
period = period,
group,
overall = TRUE,
calculate = TRUE)
# A tibble: 4 x 6
group rank period persistence_rate count overall
<dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857
> sessionInfo()
R version 3.4.2 (2017年09月28日)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6
loaded via a namespace (and not attached):
[1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2
[6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1
Final Note
The data I use interactively to test this function has about 15,000 rows, so when I mentioned performance above using system.time()
, it was with much more data that the sample data I have provided. The sample data works just fine.
1 Answer 1
A few comments in no specific order (mostly from top of your code to bottom):
What you refer to as conditional
group_by
can be done withgroup_by_at
. Define your values before with anif
call, then have a single pipe chain usinggroup_by_at
. See alsogroup_if
Your code isn't commented at all (though arguments documented)
tidyverse
functions usually use argument names that start with a dot when they contain an...
, so there are fewer chances of argument conflicts.You call
quo
and notrlang::quo
but you callrlang::eval_tidy
, so not 100% consistent (unless it would conflict with anothereval_tidy
function?).On the
valid_rank_type <- ...
line you should use||
unless you're comparing vectors or you want it to fail if rhs returns an error and lhs isTRUE
Use
if_else
rather thancase_when
if you only have 2 case :case_when(nextrank == denseRank + 1 ~ 1, TRUE ~ 0))
becomesif_else(nextrank == denseRank + 1, 1, 0)
I don't understand what sense make
calculate <- calculate
oroverall <- overall
Instead of using
if(calculate == FALSE)
andif(calculate == TRUE)
you can useif(!calculate)
andif(calculate)
(as you did withif(period_missing)
It's generally good practice (though not a hard rule) not to use return call in the middle of the code when they can be avoided. In your case you could remove them and add out as a last line before exiting the function.
sum(nextrank == (denseRank+1), na.rm = TRUE)/n()
ismean(nextrank == (denseRank+1), na.rm = TRUE)
It's not very "tidyesque" to finish your pipe chain with
as.numeric
here, though it does the job, the function that would make sense to me here isdplyr::pull
dplyr
group_by()
chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did? \$\endgroup\$