Suppose we would like maximize a likelihood function $p(\mathbf x, \mathbf z| \theta)$, where $\mathbf x$ is observed, $\mathbf z$ is a latent variable, and $\theta$ is the collection of model parameters. We would like to use expectation maximization for this.
If I understand it correctly, we optimize the marginal likelihood $p(\mathbf x|\theta)$ as $\mathbf z$ is unobserved. However, this is counterintuitive to me.
If $\mathbf z$ is unobserved, I think of it as another model parameter. Therefore, for maximum likelihood estimation, we should find $\mathbf z, \theta$ such that $p(\mathbf x|\mathbf z, \theta)$ is maximized.
So, my question is why is it standard to optimize $p(\mathbf x|\theta)$ instead of $p(\mathbf x|\mathbf z, \theta)$?
I have searched through several explanations of EM, but could not find answer to this question.
1 Answer 1
If you don't know $z$ you cannot condition on $z$ by $p(x|z,\theta)$, but we can "hallucinate" it for the lower bound function using the parameter we get in the previous step.
So, my question is why is it standard to optimize p(x|θ) instead of p(x|z,θ)?
Because of the missing data problem. $z$ is not observed and missing in our training data.
Ultimately we are optimizing $p(x|\theta)$ but it can lead to multiple local maxima and no closed-form solution then we can make it a sequence of subproblems that can be optimized in each step and guaranteed to converge to a local optimum(may be global optimum) by introducing $q$.
References:
1. What is the expectation maximization algorithm?