3
\$\begingroup\$

I have a typing log in which rows are grouped by consecutive package_name, and within those groups rows are grouped if they are logged within 180,000 ms from each other. This is working but is there a cleaner way to achieve it? An edge case would be when more than 1000 consecutive episodes are logged for the same package_name (though I think it is unlikely).

library(dplyr)
metrics <- data.frame(timestamp = c(seq(100000, 100200, by=100), seq(200000, 200200, by=100), seq(400300, 400304, by=1), seq(600400, 600600, by=100), seq(800700, 800900, by=100)), 
 package_name = c(rep("package1", 3), rep("package2", 3), rep("package2", 5), rep("package2", 3), rep("package1", 3)))
metrics <- metrics %>% 
 mutate(typing_episode = ifelse(package_name != lag(package_name), 1, 0),
 typing_episode = ifelse(is.na(typing_episode), 0, typing_episode),
 typing_episode = cumsum(typing_episode) + 1) %>% 
 group_by(typing_episode) %>% 
 mutate(time_diff2 = timestamp - lag(timestamp),
 time_diff2 = ifelse(is.na(time_diff2), 0, time_diff2),
 second_group = ifelse(time_diff2 > (1000 * 60 * 3), 1, 0),
 second_group = cumsum(second_group) + 1,
 typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)
asked Jul 2, 2019 at 17:41
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

In your code, you are checking for changes in a vector or for the differences between consecutive elements of a vector by using lag and then cleaning up the introduced NA value. When looking for changes, I would find it cleaner to handle the first element separately, which enables you to do the operation in a single line of code. For differences in timestamp, diff would make everything a lot cleaner:

metrics <- metrics %>%
 mutate(typing_episode = cumsum(c(1, head(package_name, -1) != tail(package_name, -1)))) %>%
 group_by(typing_episode) %>%
 mutate(second_group = cumsum(c(1, diff(timestamp) > 1000 * 60 * 3)),
 typing_episode2 = typing_episode * 1000 + second_group)
print(metrics)

As you note, typing_episode2 might have repeats if second_group can exceed 1000. A reasonable alternative might be to do something like typing_episode2 = paste0(typing_episode, "_", second_group). Then you won't need to worry about non-unique typing_episode2 values.

answered Jul 2, 2019 at 18:12
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.