1

I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?

Say we have data that look like this:

x <- c(100, 96, 79, 109, 111, NA, 93, 95, 119, 90, 121, 96, NA, 
 NA, 85, 95, 110, 97, 87, 104, 101, 87, 87, NA, 89, NA, 
 113, NA, 95, NA, 119, 115, NA, 105, NA, 80, 90, 108, 90, 
 99, 111, 93, 99, NA, 87, 89, 87, 126, 101, 106)

Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.

I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.

IRTFM
264k22 gold badges381 silver badges503 bronze badges
asked Feb 9, 2022 at 16:31
2
  • Wouldn't you expect that the result of an imputation for count data would be "discrete"? After all you would in many cases be submitting that output to functions which will at the very least be giving you warnings if you have non-integer inputs and at worst simply erroring out. Commented Feb 9, 2022 at 17:01
  • I apologize but I am a little confused with your comment. Imputation based on the mean or some other statistic is not doing the same thing as expectation maximization. I wanted to use a more rigorous method to estimate missing values before submitting it to another analysis. Commented Feb 10, 2022 at 22:08

1 Answer 1

1

Defining x0:

x0 <- x[!is.na(x)]

The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:

lambda <- rgamma(n, sum(x0) + 0.5, length(x0))

Then sample n missing values (xm) with

xm <- rpois(n, lambda)

Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):

xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))

As a function:

MI_poisson <- function(x, n) {
 x0 <- x[!is.na(x)]
 rbind(matrix(x0, ncol = n, nrow = length(x0)),
 matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
}

This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.

answered Feb 9, 2022 at 17:28
Sign up to request clarification or add additional context in comments.

10 Comments

You do realize this is not EM imputation yet the question clearly talks of EM imputation.
@Onyambu wouldn't EM result in the mean of the non-NA values since it's a single sample space and the ML is the mean?
The issue is that this method is not capturing the inherent stochasticity present in the missing data. This is why I wanted to use EM. Is it that difficult to adapt the EM algorithm to an exponential distribution that is not normal?
Are you looking to get a single value to replace all the NA values (single imputation), or do you want each NA to be replaced independently with a random value (multiple imputation)?
My understanding is that EM is used for single imputation. See the modified answer for a multiple imputation solution.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.