Comparing univariable regression results between complete-case and multiple imput data sets

Question 1

Dear Code Review community,

I try to develop a reproducible code to pool univariable regression estimates across multiple imputed data sets in R.

I therefore prepared the following code that is supposed to:

perform univariable regression for each variable described as "predictor" across the multiple imput datasets,
pool the estimates and confidence intervals and
report them in a data frame.

I would like to compare CC-datasets estimates to MI estimates in the long run.

I therefore prepared the following code:

library(ggplot2)
library(mice)
library(sandwich)
library(miceadds)
data <- ggplot2::diamonds
set. Seed(123)
data <- as.data.frame(
 lapply(
 data, 
 function(cc) 
 cc[
 sample(c(TRUE, NA),
 prob = c(0.85, 0.15),
 size = length(cc), 
 replace = TRUE)
 ]
 )
)
data.imp <- mice::mice(data = data, m = 2, maxit = 5, seed = 1234, print = FALSE)
#> Warning: Number of logged events: 8
data.list <- miceadds::mids2datlist(data.imp)
response <- "price"
predictors <- c("clarity", "depth", "x")
df_total <- data.frame()
for (i in predictors) {
 formula <- as.formula(paste(response, " ~ ", i))
 unimodel <- with(data.list, lm(formula))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, FUN = function(x) {sandwich::sandwichvcovCL(x)})
 result <- summary(pool_mi(betas, vars))
 df <- data.frame(result)
 df_total <- rbind(df_total, df)
}
#> Error in eval(predvars, data, env): Objekt 'price' nicht gefunden
df_total
# Created on 2023年07月28日 with reprex v2.0.2

Any thoughts on how I could improve my workflow there?

Question 2

I think this submission is ready for review. On the face of it the 'price' not found error would suggest the code's not producing the desired output, since "(2.) Code must not error". But here I believe that's part of imputing deliberately missing figures. There's no Answers yet, so there's still time to revise the Question. I recommend you clarify that "missing" behavior. Also, it would be very helpful to define one or more functions. For one thing you can put documentation on them, including input / output / behavior.

Question 3

Avoid building a data frame in a loop. See Patrick Burns's The R Inferno, Circle 2 - Growing Objects. Instead, use lapply to build a list of data frames made easier with a defined method and then rbind once outside loop. Also, consider reformulate to build formula and remove extraneous wording in second lapply call:

...
response <- "price"
predictors <- c("clarity", "depth", "x")
run_comparison <- function(i) {
 unimodel <- with(data.list, lm(reformulate(i, response)))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, sandwich::vcovCL)
 result <- summary(pool_mi(betas, vars))
 return(data.frame(predictor = i, result))
}
df_list <- lapply(predictors, run_comparison)
df_total <- do.call(rbind, unname(df_list))

Question 4

There is one problem with the code. Where you turn the formula from text to a formula matters. I think it has to do with the environment that the formula inherits. If you build and parse the formula outside of the linear model function as you do in your code, the formula cannot see the data passed down to lm(). So, the following will not work:

form <- as.formula(paste0(response, "~", i))
with(data.list, lm(form))

However, if you parse the formula inside the lm() function, it inherits the right environment, so the following works.

form <- paste0(response, "~", i)
with(data.list, lm(as.formula(form)))

Also, when I try to run the code, I get an error that sandwichvcovCL doesn't exist, it should be sandwich::vcovCL().

The things above throw errors. Another thing you may want to do is to add a variable to your dataset identifying the variable you used in the model. You could also turn the data into a tibble in which case you could easily make the row names of the result a variable. I would suggest something like this:

 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())

The full code would look like this:

library(ggplot2)
library(mice)
library(sandwich)
library(miceadds)
library(dplyr)
data <- ggplot2::diamonds
set.seed(123)
data <- as.data.frame(
 lapply(
 data, 
 function(cc) 
 cc[
 sample(c(TRUE, NA),
 prob = c(0.85, 0.15),
 size = length(cc), 
 replace = TRUE)
 ]
 )
)
data.imp <- mice::mice(data = data, m = 2, maxit = 5, seed = 1234, print = FALSE)
data.list <- miceadds::mids2datlist(data.imp)
response <- "price"
predictors <- c("clarity", "depth", "x")
df_total <- NULL
for (i in predictors) {
 myenv <- environment()
 form <- paste0(response, "~", i)
 unimodel <- with(data.list, lm(as.formula(form)))
 unimodel <- lapply(data.list, function(d)lm(form, data=d))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, FUN = function(x) {sandwich::vcovCL(x)})
 result <- summary(pool_mi(betas, vars))
 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())
 df_total <- rbind(df_total, df)
}

Parfait Parfait 9806 silver badges15 bronze badges · Answer 1 · 2023-11-25 22:11:49Z

Avoid building a data frame in a loop. See Patrick Burns's The R Inferno, Circle 2 - Growing Objects. Instead, use lapply to build a list of data frames made easier with a defined method and then rbind once outside loop. Also, consider reformulate to build formula and remove extraneous wording in second lapply call:

...
response <- "price"
predictors <- c("clarity", "depth", "x")
run_comparison <- function(i) {
 unimodel <- with(data.list, lm(reformulate(i, response)))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, sandwich::vcovCL)
 result <- summary(pool_mi(betas, vars))
 return(data.frame(predictor = i, result))
}
df_list <- lapply(predictors, run_comparison)
df_total <- do.call(rbind, unname(df_list))

DaveArmstrong DaveArmstrong 1313 bronze badges · Answer 2 · 2023-08-09 04:42:40Z

There is one problem with the code. Where you turn the formula from text to a formula matters. I think it has to do with the environment that the formula inherits. If you build and parse the formula outside of the linear model function as you do in your code, the formula cannot see the data passed down to lm(). So, the following will not work:

form <- as.formula(paste0(response, "~", i))
with(data.list, lm(form))

However, if you parse the formula inside the lm() function, it inherits the right environment, so the following works.

form <- paste0(response, "~", i)
with(data.list, lm(as.formula(form)))

Also, when I try to run the code, I get an error that sandwichvcovCL doesn't exist, it should be sandwich::vcovCL().

The things above throw errors. Another thing you may want to do is to add a variable to your dataset identifying the variable you used in the model. You could also turn the data into a tibble in which case you could easily make the row names of the result a variable. I would suggest something like this:

 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())

The full code would look like this:

library(ggplot2)
library(mice)
library(sandwich)
library(miceadds)
library(dplyr)
data <- ggplot2::diamonds
set.seed(123)
data <- as.data.frame(
 lapply(
 data, 
 function(cc) 
 cc[
 sample(c(TRUE, NA),
 prob = c(0.85, 0.15),
 size = length(cc), 
 replace = TRUE)
 ]
 )
)
data.imp <- mice::mice(data = data, m = 2, maxit = 5, seed = 1234, print = FALSE)
data.list <- miceadds::mids2datlist(data.imp)
response <- "price"
predictors <- c("clarity", "depth", "x")
df_total <- NULL
for (i in predictors) {
 myenv <- environment()
 form <- paste0(response, "~", i)
 unimodel <- with(data.list, lm(as.formula(form)))
 unimodel <- lapply(data.list, function(d)lm(form, data=d))
 betas <- lapply(unimodel, coef)
 vars <- lapply(unimodel, FUN = function(x) {sandwich::vcovCL(x)})
 result <- summary(pool_mi(betas, vars))
 df <- tibble::as_tibble(data.frame(result), rownames="param") %>% 
 mutate(variable = i) %>% 
 select(variable, everything())
 df_total <- rbind(df_total, df)
}

Stack Exchange Network

Comparing univariable regression results between complete-case and multiple imput data sets

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions