I have a typing log in which rows are grouped by consecutive package_name
, and within those groups rows are grouped if they are logged within 180,000 ms from each other. This is working but is there a cleaner way to achieve it? An edge case would be when more than 1000
consecutive episodes are logged for the same package_name
(though I think it is unlikely).
library(dplyr)
metrics <- data.frame(timestamp = c(seq(100000, 100200, by=100), seq(200000, 200200, by=100), seq(400300, 400304, by=1), seq(600400, 600600, by=100), seq(800700, 800900, by=100)),
package_name = c(rep("package1", 3), rep("package2", 3), rep("package2", 5), rep("package2", 3), rep("package1", 3)))
metrics <- metrics %>%
mutate(typing_episode = ifelse(package_name != lag(package_name), 1, 0),
typing_episode = ifelse(is.na(typing_episode), 0, typing_episode),
typing_episode = cumsum(typing_episode) + 1) %>%
group_by(typing_episode) %>%
mutate(time_diff2 = timestamp - lag(timestamp),
time_diff2 = ifelse(is.na(time_diff2), 0, time_diff2),
second_group = ifelse(time_diff2 > (1000 * 60 * 3), 1, 0),
second_group = cumsum(second_group) + 1,
typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)
1 Answer 1
In your code, you are checking for changes in a vector or for the differences between consecutive elements of a vector by using lag
and then cleaning up the introduced NA
value. When looking for changes, I would find it cleaner to handle the first element separately, which enables you to do the operation in a single line of code. For differences in timestamp, diff
would make everything a lot cleaner:
metrics <- metrics %>%
mutate(typing_episode = cumsum(c(1, head(package_name, -1) != tail(package_name, -1)))) %>%
group_by(typing_episode) %>%
mutate(second_group = cumsum(c(1, diff(timestamp) > 1000 * 60 * 3)),
typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)
As you note, typing_episode2
might have repeats if second_group
can exceed 1000. A reasonable alternative might be to do something like typing_episode2 = paste0(typing_episode, "_", second_group)
. Then you won't need to worry about non-unique typing_episode2
values.