2
\$\begingroup\$

Dear Code Review community,

I try to develop a reproducible code to pool univariable regression estimates across multiple imputed data sets in R.

I therefore prepared the following code that is supposed to:

  1. perform univariable regression for each variable described as "predictor" across the multiple imput datasets,
  2. pool the estimates and confidence intervals and
  3. report them in a data frame.

I would like to compare CC-datasets estimates to MI estimates in the long run.

I therefore prepared the following code:

library(ggplot2)
library(mice)
library(sandwich)
library(miceadds)
data <- ggplot2::diamonds
set. Seed(123)
data <- as.data.frame(
 lapply(
 data, 
 function(cc) 
 cc[
 sample(c(TRUE, NA),
 prob = c(0.85, 0.15),
 size = length(cc), 
 replace = TRUE)
 ]
 )
)
data.imp <- mice::mice(data = data, m = 2, maxit = 5, seed = 1234, print = FALSE)
#> Warning: Number of logged events: 8
data.list <- miceadds::mids2datlist(data.imp)
response <- "price"
predictors <- c("clarity", "depth", "x")
df_total <- data.frame()
for (i in predictors) {
 formula <- as.formula(paste(response, " ~ ", i))
 unimodel <- with(data.list, lm(formula))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, FUN = function(x) {sandwich::sandwichvcovCL(x)})
 result <- summary(pool_mi(betas, vars))
 df <- data.frame(result)
 df_total <- rbind(df_total, df)
}
#> Error in eval(predvars, data, env): Objekt 'price' nicht gefunden
df_total
# Created on 2023年07月28日 with reprex v2.0.2

Any thoughts on how I could improve my workflow there?

Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Aug 2, 2023 at 9:53
\$\endgroup\$
1
  • \$\begingroup\$ I think this submission is ready for review. On the face of it the 'price' not found error would suggest the code's not producing the desired output, since "(2.) Code must not error". But here I believe that's part of imputing deliberately missing figures. There's no Answers yet, so there's still time to revise the Question. I recommend you clarify that "missing" behavior. Also, it would be very helpful to define one or more functions. For one thing you can put documentation on them, including input / output / behavior. \$\endgroup\$ Commented Aug 2, 2023 at 18:40

2 Answers 2

3
\$\begingroup\$

Avoid building a data frame in a loop. See Patrick Burns's The R Inferno, Circle 2 - Growing Objects. Instead, use lapply to build a list of data frames made easier with a defined method and then rbind once outside loop. Also, consider reformulate to build formula and remove extraneous wording in second lapply call:

...
response <- "price"
predictors <- c("clarity", "depth", "x")
run_comparison <- function(i) {
 unimodel <- with(data.list, lm(reformulate(i, response)))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, sandwich::vcovCL)
 result <- summary(pool_mi(betas, vars))
 return(data.frame(predictor = i, result))
}
df_list <- lapply(predictors, run_comparison)
df_total <- do.call(rbind, unname(df_list))
answered Nov 25, 2023 at 22:11
\$\endgroup\$
2
\$\begingroup\$

There is one problem with the code. Where you turn the formula from text to a formula matters. I think it has to do with the environment that the formula inherits. If you build and parse the formula outside of the linear model function as you do in your code, the formula cannot see the data passed down to lm(). So, the following will not work:

form <- as.formula(paste0(response, "~", i))
with(data.list, lm(form))

However, if you parse the formula inside the lm() function, it inherits the right environment, so the following works.

form <- paste0(response, "~", i)
with(data.list, lm(as.formula(form)))

Also, when I try to run the code, I get an error that sandwichvcovCL doesn't exist, it should be sandwich::vcovCL().

The things above throw errors. Another thing you may want to do is to add a variable to your dataset identifying the variable you used in the model. You could also turn the data into a tibble in which case you could easily make the row names of the result a variable. I would suggest something like this:

 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())

The full code would look like this:

library(ggplot2)
library(mice)
library(sandwich)
library(miceadds)
library(dplyr)
data <- ggplot2::diamonds
set.seed(123)
data <- as.data.frame(
 lapply(
 data, 
 function(cc) 
 cc[
 sample(c(TRUE, NA),
 prob = c(0.85, 0.15),
 size = length(cc), 
 replace = TRUE)
 ]
 )
)
data.imp <- mice::mice(data = data, m = 2, maxit = 5, seed = 1234, print = FALSE)
data.list <- miceadds::mids2datlist(data.imp)
response <- "price"
predictors <- c("clarity", "depth", "x")
df_total <- NULL
for (i in predictors) {
 myenv <- environment()
 form <- paste0(response, "~", i)
 unimodel <- with(data.list, lm(as.formula(form)))
 unimodel <- lapply(data.list, function(d)lm(form, data=d))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, FUN = function(x) {sandwich::vcovCL(x)})
 result <- summary(pool_mi(betas, vars))
 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())
 df_total <- rbind(df_total, df)
}
answered Aug 9, 2023 at 4:42
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.