I am trying to use EM (Expectation-maximization) to fill in missing data in R, but am not sure how to model/code it for my specific case. I am generally trying to follow the example format used in this post (https://stats.stackexchange.com/questions/564630/how-does-expectation-maximization-compute-missing-data), with the following structure of the EM process:
library("MASS")
set.seed(123)
n <- 1000
theta <- 20
beta0 <- 3
beta1 <- 0.2
x <- runif(n=n, min=1, max=10)
mu <- exp(beta0 + beta1*x)
y <- rnegbin(n=n, mu=mu, theta=theta)
idxm <- which(y %in% sample(y, 100))
y0 <- y[-idxm]
x0 <- x[-idxm]
logy0 <- log(y0)
# initialize parameter values
betaInit <- unname(lm(logy0 ~ x0)$coefficients)
mu0 <- exp(betaInit[1] + betaInit[2]*x0)
thetaInit <- mean(mu0^2/(mean((mu0 - y0)^2) - mu0))
params1 <- c(thetaInit, betaInit)
params0 <- numeric(3)
# initialize unobserved values
y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))
# negative log-likelihood
fNLL <- function(params) {
-sum(dnbinom(y, params[1], mu = exp(params[2] + params[3]*x), log = TRUE))
}
# iteratively find MLE on imputed data and then re-impute data
while (max(abs(params1 - params0)) > 1e-9) {
params1 <- optim(params0 <- params1, fNLL, lower = 0, method = "L-BFGS-B")$par
y[idxm] <- as.integer(exp(params1[2] + params1[3]*x[idxm]))
}
However, in my case, my data is a bit different. I am trying to fill in missingness in 2 variables (x1 and x2) that will be predictors (betas) in my regression. I do not know the parameters of the complete data for x1 and x2, but I do have complete data for x3, x4, and x5, all additional predictive variables in the regression model I will ultimately run with the complete data. My ultimate regression is very simple and looks like the following:
model = lm(data, y ~ x1 + x2 + x3 + x4 + x5)
How can I iteratively update both x1 and x2 using EM to fill in their missingness, similar to the above example? Do I need to use information from x3 x4 and x5 (like multiple imputation?) or can I use a simple EM algorithm with the above and update x1 and x2 simultaneously? In my case, I do not have complete data (as this example does)
UPDATE
I have worked on the code and am a bit closer, but still am not sure how to program this. My code currently looks like the following:
library(tidyverse)
# Simulate data with missing values for two variables
n <- 100
prop_missing <- 0.2
observed_data <- data.frame(
x1 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing)),
x2 = c(rnorm(n * (1 - prop_missing)), rep(NA, n * prop_missing))
)
# Initial parameters
initial_mean_x1 <- mean(observed_data$x1,na.rm = T)
initial_sd_x1 <- sd(observed_data$x1,na.rm = T)
initial_mean_x2 <- mean(observed_data$x2,na.rm = T)
initial_sd_x2 <- sd(observed_data$x2,na.rm = T)
#Missing value row IDs
missing_id_x1 = observed_data %>% dplyr::select(x1) %>% rowid_to_column() %>% filter(is.na(x1)) %>% dplyr::select(rowid) %>% pull()
missing_id_x2 = observed_data %>% dplyr::select(x2) %>% rowid_to_column() %>% filter(is.na(x2)) %>% dplyr::select(rowid) %>% pull()
# E-step: Impute missing values based on the current parameters
imputed_data <- observed_data
imputed_data$x1[missing_id_x1] <- rnorm(length(missing_id_x1),mean = initial_mean_x1, sd = initial_sd_x1)
imputed_data$x2[missing_id_x2] <- rnorm(length(missing_id_x2),mean = initial_mean_x2, sd = initial_sd_x2)
# M-step: Update parameters by maximizing the log likelihood using optim
log_likelihood <- function(params, data, variable) {
mean_val <- params[1]
sd_val <- params[2]
log_likelihood <- sum(log(dnorm(data[!is.na(data)], mean_val, sd_val)))
return(-log_likelihood) # Minimize the negative log likelihood
}
# Update for x1
result_x1 <- optim(c(initial_mean_x1, initial_sd_x1), log_likelihood, data = observed_data$x1, variable = "x1")
# Extract the updated parameters for x1
updated_mean_x1 <- result_x1$par[1]
updated_sd_x1 <- result_x1$par[2]
# Update for x2
result_x2 <- optim(c(initial_mean_x2, initial_sd_x2), log_likelihood, data = observed_data$x2, variable = "x2")
# Extract the updated parameters for x2
updated_mean_x2 <- result_x2$par[1]
updated_sd_x2 <- result_x2$par[2]
# Output the updated parameters
paste0("Updated Parameters for x1")
paste0("Mean:", updated_mean_x1)
paste0("SD:", updated_sd_x1)
paste0("Updated Parameters for x2")
paste0("Mean:", updated_mean_x2)
paste0("SD:", updated_sd_x2)
# Output the imputed data for one iteration
paste0("Imputed Data:")
paste0(imputed_data)
As I understand, I do not need to use the other variables in the EM algorithm to impute the missing data, so I only include variables x1 and x2 in this sample data.
I extract the parameters for x1 and x2, and sample data to fill in the missingness for each. This I understand to be the E-step.
The M-step is where I am stuck. I know that I am supposed to maximize the logged likelihood to update the parameters and move toward more accurate sample draws, until I reach a tolerance threshold. But this is the tolerance for what, exactly? Is it the change in the logged likelihood, or the parameters?
If so, how do I iterate this? Do I update the same values again with the new optimized parameters and keep going until the tolerance is reached?
Update 2
I have found a code structure for updating a single variable that seems to acheive what I want to. The below is what I need to do to update values:
em.norm <- function(Y,mit,sit)
{Yobs <- y[!is.na(Y)] #Extract the observed values
Ymis <- Y[is.na(Y)] #Extract the missing values
n <- length(c(Yobs,Ymis)) #Length of total observations
r <- length(Yobs) #length of available observations
#Initial values
mut <- mit
sit <- sit
#Defin log-likelihood function
ll = function(y,mu,sigma2,n)
{
-0.5*n*log(2*pi*sigma2) - 0.5*sum((y-mu)^2)/sigma2
}
#comput log-likelihood for initial values
#and ignoring missing data mechanism
lltm1 <- ll(Yobs,mut,sit,n)
repeat{
#E-step
EY <- sum(Yobs) + (n-r)*mut
EY2 <- sum(Yobs^2) + (n-r)*(mut^2 + sit)
#M-step
mut1 <- EY/n
sit1 <- EY2/n - mut1^2
#update parameter values
mut <- mut1
sit <- sit1
#Compute log-likelihood using current estimates
#and ignoring missing data
llt <- ll(Yobs,mut,sit,n)
#Print current parameter values
cat(mut,sit,llt,"\n")
#Stop if converged
if (abs(lltm1 - llt) < 0.001) break
lltm1 <- llt
}
#fill in missing values with new mu #
return(c(mut,sit))
}
#function em norm ends here
x <- rnorm(20,5)
x[16:20] <- NA
xobs <- x[!is.na(x)]
xmis <- x[is.na(x)]
r <- length(xobs)
mit <- mean(xobs)
sit <- var(xobs)*(r-1)/r
em.norm(x,mit,sit)
However this does not seem to show the resultant completed dataset. How can I access the actual filled data, and how can I modify this to update to variables at once, or would I need to run this function separately for another variable and rejoin the datasets together?
-
Understanding the EM algorithm may be a better fit for stats.stackexchange than Stack Overflow. Any reason you want to code this yourself instead of using one of the many packages that for missing data imputation?Gregor Thomas– Gregor Thomas2023年12月13日 18:27:20 +00:00Commented Dec 13, 2023 at 18:27
-
Just read your linked question so I'll adjust to say: If you're having trouble understanding the theory of the EM Algorithm (which is what it sounds like to me), I think you'll get much better help asking a new question at stats.stackexchange than here.Gregor Thomas– Gregor Thomas2023年12月13日 18:29:33 +00:00Commented Dec 13, 2023 at 18:29
-
Glancing through the CRAN Task View on Missing Data I linked above, the RMixtCompIO package says it "includes models for real, categorical, counting, functional and ranking data".Gregor Thomas– Gregor Thomas2023年12月13日 18:32:25 +00:00Commented Dec 13, 2023 at 18:32
-
Thanks for your questions! The reasons I am trying to program this manually is to compare against the packages that perfrom imputation manually, since there are different characteristics. I tried this on stats.stackexchange, but unfortunately it was removed because it was deemed too code-heavy.flâneur– flâneur2023年12月13日 18:34:02 +00:00Commented Dec 13, 2023 at 18:34