Attention conservation notice: Re-worked teaching material, plus link-dumps.\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]
We would like to predict a scalar random variable $Y$ from a $d$-dimensional vector of other random variables $\vec{X}$. We decide that we are only interested in linear predictions, of the form \( b_0 + \vec{b} \cdot \vec{X} \). We want the optimal linear predictor, in the mean-squared sense, i.e., we want to minimize the expected squared error $\Expect{(Y-(b_0 + \vec{b} \cdot \vec{X}))^2}$.
Let's do this first for the case $d=1,ドル where the book-keeping is simple. We'll use the following basic bits of algebra about random variables:
The exact same approach works if $d> 1$. We just have to replace scalar variances and covariances with the appropriate matrices: \begin{eqnarray} \Expect{(Y-(b_0 + \vec{b} \cdot \vec{X}))^2} & = & \Var{Y} + \vec{b} \cdot \Var{\vec{X}} \vec{b} - 2 \vec{b} \cdot \Cov{\vec{X}, Y} + \left(\Expect{Y} - b_0 - \vec{b} \cdot \Expect{\vec{X}}\right)^2\\ \vec{\beta} & = & \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ \beta_0 & = & \Expect{Y} - \beta \cdot \vec{X} \end{eqnarray} and the optimal linear prediction is \[ \beta_0 + \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \cdot (\vec{X} - \Expect{\vec{X}}) \]
Plugging back in to the expected squared error, we get \begin{eqnarray} \Expect{(Y-\beta_0 - \vec{\beta} \cdot \vec{X})^2} & = & \Var{Y - \vec{\beta} \cdot \vec{X}}\\ & = & \Var{Y} + \vec{\beta} \cdot \Var{\vec{X}} \vec{\beta} - 2\vec{\beta} \cdot\Cov{\vec{X}, Y}\\ & = & \Var{Y} + \left(\Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\right) \cdot \Var{\vec{X}} \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & & - 2 \left(\Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\right) \cdot \Cov{\vec{X}, Y}\\ & = & \Var{Y} - \Cov{\vec{X},Y} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \end{eqnarray} Since $ \Var{\vec{X}}$ is a variance matrix, it's positive semi-definite, i.e., all its eigenvalues are $\geq 0$. Therefore so is its inverse, therefore $\Cov{\vec{X},Y} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \geq 0,ドル and the expected squared error of the optimal linear predictor is $\leq \Var{Y},ドル with equality only if the vector of covariances $\Cov{\vec{X}, Y}$ is $\vec{0},ドル or lies in the null space of $\Var{\vec{X}}^{-1}$ (assuming it has one). In general, the optimal linear predictor will reduce the expected squared error below $\Var{Y}$. (Incidentally, if we started from the last line, few would find it obvious that it is $\geq 0,ドル but since it is the expectation of a square, and a variance, it must be.)
Define $\epsilon \equiv Y - (\beta_0 + \vec{\beta} \cdot \vec{X})$ as the random error of this prediction. We have just worked out $\Expect{\epsilon^2}$. We can also establish that
At no point in any of this have we had to assume:
Over years and disciplines, this circle of ideas has been re-discovered, or re-purposed, many times. Here are some instances:
In the applications to stochastic processes, it is not actually necessary, as far as this math goes, that the process be stationary. This however brings me to a point, which alert/suspicious readers may well have anticipated...
So far, all of this has been at the level of pure probability theory, where expectations, variances, covariances, and other properties of the true distribution are available to us --- no doubt written in characters of gold on marble tablets by a fiery Hand. In the real world of statistics, as opposed to the mythology of probability, we do not have the true distribution, or even any of its moments, what we have are estimates.
If we have observed many pairs of $\vec{X}$ and $Y,ドル and we assume their distributions are unchanging --- at least up to the second moments! --- then we can just find the sample means, sample variances and covariances, etc., and plug in. Remarkably (or obviously?) enough, this is the same estimate as just doing a linear regression of $Y$ on $\vec{X}$ and estimating by ordinary least squares [2]. It will be consistent when sample covariances converge on true covariances, which requires something rather weaker than independent observations.
If we are looking at a stationary stochastic process, we can of course try to estimate the covariances. Thus in the case of interpolating or extrapolating a time series, if we think the series is (weakly) stationary, we are committing ourselves to $\Cov{Z(t), Z(s)} = \gamma(|t-s|)$ for some autocovariance function $\gamma$. (Then of course $\Var{Z(t)} = \gamma(0)$.) Writing $\overline{z}$ for the sample mean, every pair of observations $z(t), z(s)$ thus gives us a point $(|t-s|, (z(t)-\overline{z})(z(s)-\overline{z}))$ which we can use to estimate $\gamma$. If all else fails, we could use nonparametric regression to do this. If we assume that $\gamma$ has a certain functional form, that will of course improve the efficiency in our estimate of $\gamma$. We can do the same sort of thing for spatial and spatio-temporal processes, where now additional sorts of symmetry (isotropy, "separating" $\Gamma(h, \mathbf{b})$ into $\Gamma_{time}(h) \Gamma_{space}(\mathbf{b}),ドル etc.) are additional efficiency-enhancing constraints. Of course, if any of those assumptions are off, they are helping us converge efficiently to a systematically-wrong answer, but even that might be better, for prediction, than glacial convergence to the truth. (I am not interested in factor models for high-dimensional time series because I think those models are right.)
Stationarity thus helps prediction, because it helps us learn the covariances. If we were dealing with non-stationary processes where we knew something about how they are non-stationary, we could use that information instead. This is, of course, a tall order!
Everything I have said so far about estimating the covariances presumed that we can, in fact, simultaneously observe both $\vec{X}$ and $Y$. When the role of $Y$ is taken by a latent variable, however, this is harder to swallow. Take the case where we observe a time series $Z(t) = S(t) + N(t),ドル with everything stationary, and want to estimate or predict $S(t^*)$ from $(Z(t_1), Z(t_2), \ldots Z(t_d))$. We will need the variances and covariances $\Cov{Z(t_i), Z(t_j)}$; stationarity in principle lets us estimate those, as $\gamma(t_i - t_j)$. But we will also need $\Cov{Z(t_i), S(t^*)},ドル and even if we assume stationarity, how are we to get that, without any joint observations of $Z(t)$ and $S(t+h)$? If we assume that the noise process $N$ is uncorrelated with the state/signal process $S,ドル and that $N$ has no autocorrelation over time (it is "white noise"), then, for $h\neq 0$ \begin{eqnarray} \gamma(h) & = & \Cov{Z(t), Z(t+h)} \\ & = & \Cov{S(t)+N(t), S(t+h) + N(t+h)}\\ & = & \Cov{S(t), S(t+h)} + \Cov{N(t), S(t+h)} + \Cov{S(t), N(t+h)} + \Cov{N(t), N(t+h)}\\ & = & \Cov{S(t), S(t+h)} \end{eqnarray} I bring this up not because we need the autocovariance function of the signal, but because, similarly, \begin{eqnarray} \Cov{Z(t), S(t+h)} & = & \Cov{S(t) + N(t), S(t+h)}\\ & = & \Cov{S(t), S(t+h)} = \gamma(h) \end{eqnarray} That is, if we assume that we are dealing with white noise, we can learn all the necessary covariances. But if we made a different assumption about the noise (perhaps that it has exponentially-decaying autocovariance), we will get different estimates. Worse, there really are not many reliable ways of checking assumptions about the noise, precisely because $S(t)$ is latent. Or, rather, assumption-checking is hard if all we have is the time series $Z(t)$. We might, for some physical measurement processes, try to actually study how measurement works, what distribution of observations we get when we impose known signals (= sample-paths of $S$), etc., but that work, while scientifically important, is outside the scope of pure statistical methodology...
Similarly with the psychometric situation, except worse. (We sadly lack standardized subjects with calibrated levels of narcissism, implicit racism, etc., who can be run through a new psychological test to estimate its loadings.) We can infer loadings from covariances (up to some here-unimportant symmetries), but doing so rests on some assumptions which are hard-to-impossible to check with just the observable test scores.
Now, any sort of statistical inference necessarily rests on assumptions. (They are what let us connect the data we have to other things we do not directly see, rather than just saying "Yes, that's the data alright".) But there are times when the assumptions can fade into the background, and other times when the math (so to speak) rubs them in our faces. I find it interesting that inference-to-latents is one of the latter situations.
[1]: If you insist on taking the derivative of the square with respect to \( b_1 \), you will find that it vanishes at the value of \( b_0 \) which zeroes out the square. (Because $\frac{dx^2}{dx} = 2x,ドル which is zero at the origin.) ^
[2]: Here is a quick way to convince yourself of this, if perhaps not a very intuitive one. The argument which lead to expressing the optimal linear predictor entirely in terms of the first and second moments of the joint distribution of $\vec{X}$ and $Y$ was completely indifferent to what that distribution might be. Therefore, if we have a bunch of observation pairs $(\vec{x}_1, y_1), (\vec{x}_2, y_2), \ldots (\vec{x}_n, y_n),ドル we can apply the argument to the resulting empirical distribution. Expectations under the empirical distribution are sample means, and variances and covariances under the empirical distribution are sample variances and covariances. Since (i) those "plug-in" estimates minimize the in-sample mean squared error (by the argument above), and (ii) OLS estimates also minimize the in-sample MSE (that is how we derive OLS!), and (iii) the minimizer of the MSE is unique (exercise!), we can conclude that the plug-in estimates and OLS must coincide. If you are skeptical, you will find it enlightening to write the OLS formula as $\hat{\beta} = (\frac{1}{n}\mathbf{x}^T\mathbf{x})^{-1}(\frac{1}{n}\mathbf{x}^T\mathbf{y}),ドル with the traditional column of 1s in $\mathbf{x},ドル and explicitly work everything out for one or two regressors (i.e., for \( d=1 \) and \(d = 2\)), including inverting the matrix. ^
I have no conclusions here.