I have a data.frame of people broken into households where I have a set of mostly unique keys (for household, and then person within household), but sometimes it looks like 2 (or possibly more) households were given the same key.
small_example <- tibble::tribble(
~hh_id, ~per_id, ~ref,
1, 1, "a",
1, 2, "b",
1, 3, "c",
2, 1, "d",
2, 1, "e",
2, 2, "f",
2, 2, "g",
2, 3, "h",
2, 4, "i"
)
In this example, the first household is ok, but I want to split the second into 2 households randomly, but to preserve as much of the original structure as possible. In particular, I want persons "d" and "f" to stay together and "e"/"g" to stay together and then persons "h" and "i" to be added to one of these 2 households.
Here's my first attempt at this, but my code is pretty slow. I can tell I'm over-using the tidyverse, but am not sure what a better alternative is.
library(dplyr)
library(purrr)
assign_extra_id <- function(per_grp_id) {
all_pers_df <- data_frame(orig_id = per_grp_id) %>%
group_by(orig_id) %>%
mutate(order_num = row_number(), total_num = n())
max_dup_pers <- max(all_pers_df$total_num)
if (max_dup_pers == 1) return(1)
multi_pos_order <- accumulate(
seq(max_dup_pers, 2, -1),
~sample(.x, .y - 1),
.init = sample(seq_len(max_dup_pers), max_dup_pers)
) %>%
set_names(seq(max_dup_pers, 1, -1)) %>%
map_df(~data_frame(order_num = seq_along(.), new_id = .), .id = "total_num") %>%
mutate(total_num = as.numeric(total_num))
out <- left_join(all_pers_df, multi_pos_order, by = c("total_num", "order_num"))
out$new_id
}
small_example %>%
group_by(hh_id) %>%
mutate(extra_id = assign_extra_id(per_id))
#> # A tibble: 9 x 4
#> # Groups: hh_id [2]
#> hh_id per_id ref extra_id
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 2 b 1
#> 3 1 3 c 1
#> 4 2 1 d 2
#> 5 2 1 e 1
#> 6 2 2 f 2
#> 7 2 2 g 1
#> 8 2 3 h 1
#> 9 2 4 i 1
And here's a timing:
bigger_example <- map_df(seq_len(100), ~mutate(small_example, hh_id = hh_id + (2 * .)))
microbenchmark::microbenchmark(
my_attempt = bigger_example %>%
group_by(hh_id) %>%
mutate(extra_id = assign_extra_id(per_id)),
times = 10
)
Unit: seconds
expr min lq mean median uq max neval
my_attempt 2.297449 2.305819 2.327998 2.312012 2.354128 2.381427 10
I want to randomly sample from the 4 sets of households I consider valid for what is currently hh_id = 2
:
Set 1: hh1 = d, f, h, i; hh2 = e, g
Set 2: hh1 = d, f; hh2 = e, g, h, i
Set 3: hh1 = e, g; hh2 = d, f, h, i
Set 4: hh1 = e, g, h, i; hh2 = d, f
The logic for this when there are 3 households with the same hh_id
gets even more complicated, because if there's only 1 person with a given person id, I want to sample from the households that were available when there were 2 (and so on). This is why there's the kind of hairy purrr::accumulate
call.
1 Answer 1
Yes, I think it is better to use efficient base functions for this kind of task. I'm assuming your ids are sorted so that duplicates will always be next to each other, is that right? In that case, a first idea would be to use the duplicated
function. Using your example, see that:
x <- c(1, 1, 2, 2, 3, 4)
duplicated(x)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE
1 + duplicated(x)
# [1] 1 2 1 2 1 1
However, this will not work in your more general case where there may be more than 2 duplicated households:
x <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5)
1 + duplicated(x)
# [1] 1 2 2 1 2 2 1 2 1 1
For the general case, I would use run length encoding functions:
rle(x)
# Run Length Encoding
# lengths: int [1:5] 3 3 2 1 1
# values : num [1:5] 1 2 3 4 5
sequence(rle(x)$lengths)
# [1] 1 2 3 1 2 3 1 2 1 1
So I would suggest you replace your assign_extra_id
with the following:
assign_extra_id2 <- function(x) sequence(rle(x)$lengths)
Update
Taking into account the added details (in the comments and updated question), maybe this function will do?
assign_extra_id3 <- function(x) {
resample <- function(x, ...) x[sample.int(length(x), ...)]
z <- rle(rle(x)$lengths)
s <- sample(z$values[1])
i <- Map(function(l,v) rep(resample(s[1:v]),l), z$lengths, z$values)
unlist(i, use.names = FALSE)
}
It's (still) a nice speedup:
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_attempt 5013.47742 5051.44837 6198.35973 5535.9724 6951.75439 9843.3606 10
# fd_attempt 54.48751 59.92706 73.31207 62.1226 82.82826 130.2046 10
-
\$\begingroup\$ Thanks for taking a look, this is interesting - I hadn't seen rle before. However, this isn't what I want - this is equivalent to `bigger_example %>% group_by(hh_id, per_id) %>% mutate(extra_id = row_number()). However, I want a random sample of the 4 households I consider valid. I'll try to update my question to make it clearer. \$\endgroup\$GregF– GregF2017年12月10日 21:37:18 +00:00Commented Dec 10, 2017 at 21:37
-
\$\begingroup\$ @GregF, can you please try
assign_extra_id3 <- function(x) unlist(Map(function(z) sample(length(z)), split(x, x)), use.names = FALSE)
\$\endgroup\$flodel– flodel2017年12月10日 22:41:07 +00:00Commented Dec 10, 2017 at 22:41 -
\$\begingroup\$ no, that's still not quite right. It doesn't keep person d/f and e/g together.
set.seed(123); test <- small_example %>% group_by(hh_id) %>% mutate(extra_id = assign_extra_id3(per_id)); test$extra_id[test$ref %in% c("d", "f")]
gives 2, 1 but I want them to always get the same household. \$\endgroup\$GregF– GregF2017年12月11日 15:02:08 +00:00Commented Dec 11, 2017 at 15:02 -
\$\begingroup\$ @GregF. Ok... One more try... See if my updated answer does the job? \$\endgroup\$flodel– flodel2017年12月12日 00:54:23 +00:00Commented Dec 12, 2017 at 0:54