I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?
Say we have data that look like this:
x <- c(100, 96, 79, 109, 111, NA, 93, 95, 119, 90, 121, 96, NA,
NA, 85, 95, 110, 97, 87, 104, 101, 87, 87, NA, 89, NA,
113, NA, 95, NA, 119, 115, NA, 105, NA, 80, 90, 108, 90,
99, 111, 93, 99, NA, 87, 89, 87, 126, 101, 106)
Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.
I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.
-
Wouldn't you expect that the result of an imputation for count data would be "discrete"? After all you would in many cases be submitting that output to functions which will at the very least be giving you warnings if you have non-integer inputs and at worst simply erroring out.IRTFM– IRTFM2022年02月09日 17:01:13 +00:00Commented Feb 9, 2022 at 17:01
-
I apologize but I am a little confused with your comment. Imputation based on the mean or some other statistic is not doing the same thing as expectation maximization. I wanted to use a more rigorous method to estimate missing values before submitting it to another analysis.Constantin– Constantin2022年02月10日 22:08:41 +00:00Commented Feb 10, 2022 at 22:08
1 Answer 1
Defining x0:
x0 <- x[!is.na(x)]
The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:
lambda <- rgamma(n, sum(x0) + 0.5, length(x0))
Then sample n missing values (xm) with
xm <- rpois(n, lambda)
Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):
xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))
As a function:
MI_poisson <- function(x, n) {
x0 <- x[!is.na(x)]
rbind(matrix(x0, ncol = n, nrow = length(x0)),
matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
}
This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.
10 Comments
NA values (single imputation), or do you want each NA to be replaced independently with a random value (multiple imputation)?