I am trying to assign a 0 or 1 with some stochasticity based on another column in a data frame (outcome
). If outcome == 1
, the new column exposure
should equal 1 about 90% of the time. Conversely if outcome == 0
it should equal 1 about 20% of the time.
I am currently doing this with a for
loop but wondering if there are more efficient/elegant ways to accomplish this (i.e. via vectorization).
To be clear, though the data frame is labeled example_data
this is not an example - its the test data I am generating to test a series of functions related to GEE models.
set.seed(05062020)
example_data <- data.frame(id = as.factor(rep(sprintf("Record %s",seq(1:50)), each = 2)),
outcome = as.factor(rep(sample(0:1, 50, prob = c(0.8,0.2), replace = TRUE), each = 2)))
for (i in 1:nrow(example_data)){
example_data$exposure[i] <- ifelse(example_data$outcome[i] == 1,
sample(0:1, 1, prob = c(0.1, 0.9)),
sample(0:1, 1, prob = c(0.8, 0.2)))
}
2 Answers 2
example_data$exposure <- ifelse(example_data$outcome == 1,
sample(0:1, nrow(example_data), prob = c(0.1, 0.9), replace = T),
sample(0:1, nrow(example_data), prob = c(0.8, 0.2), replace = T))
ifelse
is vectorized, so we can do this with one function call.
-
\$\begingroup\$ Ack I tried this but forgot to add the
replace = TRUE
and so it didnt work. Thank you! \$\endgroup\$jpsmith– jpsmith2021年12月14日 01:24:16 +00:00Commented Dec 14, 2021 at 1:24
Sampling coin flips with data-dependent probabilities can often be done elegantly by thresholding a uniform random variable:
example_data$exposure <-
as.numeric(runif(nrow(example_data)) <= 0.2 + 0.7*(example_data$outcome==1))
So basically the threshold is 0.2 when example_data$outcome == 0
and is 0.9 when example_data$outcome == 1
. I used 0.7*(example_data$outcome==1)
instead of just 0.7*example_data$outcome
because example_data$outcome
is defined as a factor in your data frame, and the as.numeric
converts TRUE/FALSE
into 1/0
.