1
\$\begingroup\$

I have a data.frame of people broken into households where I have a set of mostly unique keys (for household, and then person within household), but sometimes it looks like 2 (or possibly more) households were given the same key.

small_example <- tibble::tribble(
 ~hh_id, ~per_id, ~ref,
 1, 1, "a",
 1, 2, "b",
 1, 3, "c",
 2, 1, "d",
 2, 1, "e",
 2, 2, "f",
 2, 2, "g",
 2, 3, "h",
 2, 4, "i"
)

In this example, the first household is ok, but I want to split the second into 2 households randomly, but to preserve as much of the original structure as possible. In particular, I want persons "d" and "f" to stay together and "e"/"g" to stay together and then persons "h" and "i" to be added to one of these 2 households.

Here's my first attempt at this, but my code is pretty slow. I can tell I'm over-using the tidyverse, but am not sure what a better alternative is.

library(dplyr)
library(purrr)
assign_extra_id <- function(per_grp_id) {
 all_pers_df <- data_frame(orig_id = per_grp_id) %>%
 group_by(orig_id) %>%
 mutate(order_num = row_number(), total_num = n())
 max_dup_pers <- max(all_pers_df$total_num)
 if (max_dup_pers == 1) return(1)
 multi_pos_order <- accumulate(
 seq(max_dup_pers, 2, -1),
 ~sample(.x, .y - 1),
 .init = sample(seq_len(max_dup_pers), max_dup_pers)
 ) %>%
 set_names(seq(max_dup_pers, 1, -1)) %>%
 map_df(~data_frame(order_num = seq_along(.), new_id = .), .id = "total_num") %>%
 mutate(total_num = as.numeric(total_num))
 out <- left_join(all_pers_df, multi_pos_order, by = c("total_num", "order_num"))
 out$new_id
}
small_example %>%
 group_by(hh_id) %>%
 mutate(extra_id = assign_extra_id(per_id))
#> # A tibble: 9 x 4
#> # Groups: hh_id [2]
#> hh_id per_id ref extra_id
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 2 b 1
#> 3 1 3 c 1
#> 4 2 1 d 2
#> 5 2 1 e 1
#> 6 2 2 f 2
#> 7 2 2 g 1
#> 8 2 3 h 1
#> 9 2 4 i 1

And here's a timing:

bigger_example <- map_df(seq_len(100), ~mutate(small_example, hh_id = hh_id + (2 * .)))
microbenchmark::microbenchmark(
 my_attempt = bigger_example %>%
 group_by(hh_id) %>%
 mutate(extra_id = assign_extra_id(per_id)),
 times = 10
)
Unit: seconds
 expr min lq mean median uq max neval
 my_attempt 2.297449 2.305819 2.327998 2.312012 2.354128 2.381427 10

I want to randomly sample from the 4 sets of households I consider valid for what is currently hh_id = 2:

Set 1: hh1 = d, f, h, i; hh2 = e, g
Set 2: hh1 = d, f; hh2 = e, g, h, i
Set 3: hh1 = e, g; hh2 = d, f, h, i
Set 4: hh1 = e, g, h, i; hh2 = d, f

The logic for this when there are 3 households with the same hh_id gets even more complicated, because if there's only 1 person with a given person id, I want to sample from the households that were available when there were 2 (and so on). This is why there's the kind of hairy purrr::accumulate call.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Dec 8, 2017 at 18:00
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Yes, I think it is better to use efficient base functions for this kind of task. I'm assuming your ids are sorted so that duplicates will always be next to each other, is that right? In that case, a first idea would be to use the duplicated function. Using your example, see that:

x <- c(1, 1, 2, 2, 3, 4)
duplicated(x)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE
1 + duplicated(x)
# [1] 1 2 1 2 1 1

However, this will not work in your more general case where there may be more than 2 duplicated households:

x <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5)
1 + duplicated(x)
# [1] 1 2 2 1 2 2 1 2 1 1

For the general case, I would use run length encoding functions:

rle(x)
# Run Length Encoding
# lengths: int [1:5] 3 3 2 1 1
# values : num [1:5] 1 2 3 4 5
sequence(rle(x)$lengths)
# [1] 1 2 3 1 2 3 1 2 1 1

So I would suggest you replace your assign_extra_id with the following:

assign_extra_id2 <- function(x) sequence(rle(x)$lengths)

Update

Taking into account the added details (in the comments and updated question), maybe this function will do?

assign_extra_id3 <- function(x) {
 resample <- function(x, ...) x[sample.int(length(x), ...)]
 z <- rle(rle(x)$lengths)
 s <- sample(z$values[1])
 i <- Map(function(l,v) rep(resample(s[1:v]),l), z$lengths, z$values)
 unlist(i, use.names = FALSE)
}

It's (still) a nice speedup:

# Unit: milliseconds
# expr min lq mean median uq max neval
 # op_attempt 5013.47742 5051.44837 6198.35973 5535.9724 6951.75439 9843.3606 10
 # fd_attempt 54.48751 59.92706 73.31207 62.1226 82.82826 130.2046 10
answered Dec 8, 2017 at 21:10
\$\endgroup\$
4
  • \$\begingroup\$ Thanks for taking a look, this is interesting - I hadn't seen rle before. However, this isn't what I want - this is equivalent to `bigger_example %>% group_by(hh_id, per_id) %>% mutate(extra_id = row_number()). However, I want a random sample of the 4 households I consider valid. I'll try to update my question to make it clearer. \$\endgroup\$ Commented Dec 10, 2017 at 21:37
  • \$\begingroup\$ @GregF, can you please try assign_extra_id3 <- function(x) unlist(Map(function(z) sample(length(z)), split(x, x)), use.names = FALSE) \$\endgroup\$ Commented Dec 10, 2017 at 22:41
  • \$\begingroup\$ no, that's still not quite right. It doesn't keep person d/f and e/g together. set.seed(123); test <- small_example %>% group_by(hh_id) %>% mutate(extra_id = assign_extra_id3(per_id)); test$extra_id[test$ref %in% c("d", "f")] gives 2, 1 but I want them to always get the same household. \$\endgroup\$ Commented Dec 11, 2017 at 15:02
  • \$\begingroup\$ @GregF. Ok... One more try... See if my updated answer does the job? \$\endgroup\$ Commented Dec 12, 2017 at 0:54

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.