Split groups into unique sets

Question 1

I have a data.frame of people broken into households where I have a set of mostly unique keys (for household, and then person within household), but sometimes it looks like 2 (or possibly more) households were given the same key.

small_example <- tibble::tribble(
 ~hh_id, ~per_id, ~ref,
 1, 1, "a",
 1, 2, "b",
 1, 3, "c",
 2, 1, "d",
 2, 1, "e",
 2, 2, "f",
 2, 2, "g",
 2, 3, "h",
 2, 4, "i"
)

In this example, the first household is ok, but I want to split the second into 2 households randomly, but to preserve as much of the original structure as possible. In particular, I want persons "d" and "f" to stay together and "e"/"g" to stay together and then persons "h" and "i" to be added to one of these 2 households.

Here's my first attempt at this, but my code is pretty slow. I can tell I'm over-using the tidyverse, but am not sure what a better alternative is.

library(dplyr)
library(purrr)
assign_extra_id <- function(per_grp_id) {
 all_pers_df <- data_frame(orig_id = per_grp_id) %>%
 group_by(orig_id) %>%
 mutate(order_num = row_number(), total_num = n())
 max_dup_pers <- max(all_pers_df$total_num)
 if (max_dup_pers == 1) return(1)
 multi_pos_order <- accumulate(
 seq(max_dup_pers, 2, -1),
 ~sample(.x, .y - 1),
 .init = sample(seq_len(max_dup_pers), max_dup_pers)
 ) %>%
 set_names(seq(max_dup_pers, 1, -1)) %>%
 map_df(~data_frame(order_num = seq_along(.), new_id = .), .id = "total_num") %>%
 mutate(total_num = as.numeric(total_num))
 out <- left_join(all_pers_df, multi_pos_order, by = c("total_num", "order_num"))
 out$new_id
}
small_example %>%
 group_by(hh_id) %>%
 mutate(extra_id = assign_extra_id(per_id))
#> # A tibble: 9 x 4
#> # Groups: hh_id [2]
#> hh_id per_id ref extra_id
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 2 b 1
#> 3 1 3 c 1
#> 4 2 1 d 2
#> 5 2 1 e 1
#> 6 2 2 f 2
#> 7 2 2 g 1
#> 8 2 3 h 1
#> 9 2 4 i 1

And here's a timing:

bigger_example <- map_df(seq_len(100), ~mutate(small_example, hh_id = hh_id + (2 * .)))
microbenchmark::microbenchmark(
 my_attempt = bigger_example %>%
 group_by(hh_id) %>%
 mutate(extra_id = assign_extra_id(per_id)),
 times = 10
)
Unit: seconds
 expr min lq mean median uq max neval
 my_attempt 2.297449 2.305819 2.327998 2.312012 2.354128 2.381427 10

I want to randomly sample from the 4 sets of households I consider valid for what is currently hh_id = 2:

Set 1: hh1 = d, f, h, i; hh2 = e, g
Set 2: hh1 = d, f; hh2 = e, g, h, i
Set 3: hh1 = e, g; hh2 = d, f, h, i
Set 4: hh1 = e, g, h, i; hh2 = d, f

The logic for this when there are 3 households with the same hh_id gets even more complicated, because if there's only 1 person with a given person id, I want to sample from the households that were available when there were 2 (and so on). This is why there's the kind of hairy purrr::accumulate call.

Question 2

Yes, I think it is better to use efficient base functions for this kind of task. I'm assuming your ids are sorted so that duplicates will always be next to each other, is that right? In that case, a first idea would be to use the duplicated function. Using your example, see that:

x <- c(1, 1, 2, 2, 3, 4)
duplicated(x)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE
1 + duplicated(x)
# [1] 1 2 1 2 1 1

However, this will not work in your more general case where there may be more than 2 duplicated households:

x <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5)
1 + duplicated(x)
# [1] 1 2 2 1 2 2 1 2 1 1

For the general case, I would use run length encoding functions:

rle(x)
# Run Length Encoding
# lengths: int [1:5] 3 3 2 1 1
# values : num [1:5] 1 2 3 4 5
sequence(rle(x)$lengths)
# [1] 1 2 3 1 2 3 1 2 1 1

So I would suggest you replace your assign_extra_id with the following:

assign_extra_id2 <- function(x) sequence(rle(x)$lengths)

Update

Taking into account the added details (in the comments and updated question), maybe this function will do?

assign_extra_id3 <- function(x) {
 resample <- function(x, ...) x[sample.int(length(x), ...)]
 z <- rle(rle(x)$lengths)
 s <- sample(z$values[1])
 i <- Map(function(l,v) rep(resample(s[1:v]),l), z$lengths, z$values)
 unlist(i, use.names = FALSE)
}

It's (still) a nice speedup:

# Unit: milliseconds
# expr min lq mean median uq max neval
 # op_attempt 5013.47742 5051.44837 6198.35973 5535.9724 6951.75439 9843.3606 10
 # fd_attempt 54.48751 59.92706 73.31207 62.1226 82.82826 130.2046 10

Question 3

Thanks for taking a look, this is interesting - I hadn't seen rle before. However, this isn't what I want - this is equivalent to `bigger_example %>% group_by(hh_id, per_id) %>% mutate(extra_id = row_number()). However, I want a random sample of the 4 households I consider valid. I'll try to update my question to make it clearer.

Question 4

@GregF, can you please try assign_extra_id3 <- function(x) unlist(Map(function(z) sample(length(z)), split(x, x)), use.names = FALSE)

Question 5

no, that's still not quite right. It doesn't keep person d/f and e/g together.

set.seed(123); test <- small_example %>% group_by(hh_id) %>% mutate(extra_id = assign_extra_id3(per_id)); test$extra_id[test$ref %in% c("d", "f")]

gives 2, 1 but I want them to always get the same household.

Question 6

@GregF. Ok... One more try... See if my updated answer does the job?

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2017-12-08 21:10:38Z

Yes, I think it is better to use efficient base functions for this kind of task. I'm assuming your ids are sorted so that duplicates will always be next to each other, is that right? In that case, a first idea would be to use the duplicated function. Using your example, see that:

x <- c(1, 1, 2, 2, 3, 4)
duplicated(x)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE
1 + duplicated(x)
# [1] 1 2 1 2 1 1

However, this will not work in your more general case where there may be more than 2 duplicated households:

x <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5)
1 + duplicated(x)
# [1] 1 2 2 1 2 2 1 2 1 1

For the general case, I would use run length encoding functions:

rle(x)
# Run Length Encoding
# lengths: int [1:5] 3 3 2 1 1
# values : num [1:5] 1 2 3 4 5
sequence(rle(x)$lengths)
# [1] 1 2 3 1 2 3 1 2 1 1

So I would suggest you replace your assign_extra_id with the following:

assign_extra_id2 <- function(x) sequence(rle(x)$lengths)

Update

Taking into account the added details (in the comments and updated question), maybe this function will do?

assign_extra_id3 <- function(x) {
 resample <- function(x, ...) x[sample.int(length(x), ...)]
 z <- rle(rle(x)$lengths)
 s <- sample(z$values[1])
 i <- Map(function(l,v) rep(resample(s[1:v]),l), z$lengths, z$values)
 unlist(i, use.names = FALSE)
}

It's (still) a nice speedup:

# Unit: milliseconds
# expr min lq mean median uq max neval
 # op_attempt 5013.47742 5051.44837 6198.35973 5535.9724 6951.75439 9843.3606 10
 # fd_attempt 54.48751 59.92706 73.31207 62.1226 82.82826 130.2046 10

Thanks for taking a look, this is interesting - I hadn't seen rle before. However, this isn't what I want - this is equivalent to `bigger_example %>% group_by(hh_id, per_id) %>% mutate(extra_id = row_number()). However, I want a random sample of the 4 households I consider valid. I'll try to update my question to make it clearer.
@GregF, can you please try assign_extra_id3 <- function(x) unlist(Map(function(z) sample(length(z)), split(x, x)), use.names = FALSE)
no, that's still not quite right. It doesn't keep person d/f and e/g together. set.seed(123); test <- small_example %>% group_by(hh_id) %>% mutate(extra_id = assign_extra_id3(per_id)); test$extra_id[test$ref %in% c("d", "f")] gives 2, 1 but I want them to always get the same household.
@GregF. Ok... One more try... See if my updated answer does the job?

Stack Exchange Network

Split groups into unique sets

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split groups into unique sets

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions