Group by consecutive values in first column and second column value

Question 1

I have a typing log in which rows are grouped by consecutive package_name, and within those groups rows are grouped if they are logged within 180,000 ms from each other. This is working but is there a cleaner way to achieve it? An edge case would be when more than 1000 consecutive episodes are logged for the same package_name (though I think it is unlikely).

library(dplyr)
metrics <- data.frame(timestamp = c(seq(100000, 100200, by=100), seq(200000, 200200, by=100), seq(400300, 400304, by=1), seq(600400, 600600, by=100), seq(800700, 800900, by=100)), 
 package_name = c(rep("package1", 3), rep("package2", 3), rep("package2", 5), rep("package2", 3), rep("package1", 3)))
metrics <- metrics %>% 
 mutate(typing_episode = ifelse(package_name != lag(package_name), 1, 0),
 typing_episode = ifelse(is.na(typing_episode), 0, typing_episode),
 typing_episode = cumsum(typing_episode) + 1) %>% 
 group_by(typing_episode) %>% 
 mutate(time_diff2 = timestamp - lag(timestamp),
 time_diff2 = ifelse(is.na(time_diff2), 0, time_diff2),
 second_group = ifelse(time_diff2 > (1000 * 60 * 3), 1, 0),
 second_group = cumsum(second_group) + 1,
 typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)

Question 2

In your code, you are checking for changes in a vector or for the differences between consecutive elements of a vector by using lag and then cleaning up the introduced NA value. When looking for changes, I would find it cleaner to handle the first element separately, which enables you to do the operation in a single line of code. For differences in timestamp, diff would make everything a lot cleaner:

metrics <- metrics %>%
 mutate(typing_episode = cumsum(c(1, head(package_name, -1) != tail(package_name, -1)))) %>%
 group_by(typing_episode) %>%
 mutate(second_group = cumsum(c(1, diff(timestamp) > 1000 * 60 * 3)),
 typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)

As you note, typing_episode2 might have repeats if second_group can exceed 1000. A reasonable alternative might be to do something like typing_episode2 = paste0(typing_episode, "_", second_group). Then you won't need to worry about non-unique typing_episode2 values.

josliber josliber 1,2219 silver badges17 bronze badges · Accepted Answer · 2019-07-02 18:12:50Z

In your code, you are checking for changes in a vector or for the differences between consecutive elements of a vector by using lag and then cleaning up the introduced NA value. When looking for changes, I would find it cleaner to handle the first element separately, which enables you to do the operation in a single line of code. For differences in timestamp, diff would make everything a lot cleaner:

metrics <- metrics %>%
 mutate(typing_episode = cumsum(c(1, head(package_name, -1) != tail(package_name, -1)))) %>%
 group_by(typing_episode) %>%
 mutate(second_group = cumsum(c(1, diff(timestamp) > 1000 * 60 * 3)),
 typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)

As you note, typing_episode2 might have repeats if second_group can exceed 1000. A reasonable alternative might be to do something like typing_episode2 = paste0(typing_episode, "_", second_group). Then you won't need to worry about non-unique typing_episode2 values.

Stack Exchange Network

Group by consecutive values in first column and second column value

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Group by consecutive values in first column and second column value

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions