Jump to content
Wikipedia The Free Encyclopedia

Mallows's Cp

From Wikipedia, the free encyclopedia
Statistic used in model selection

In statistics, Mallows's C p {\textstyle {\boldsymbol {C_{p}}}} {\textstyle {\boldsymbol {C_{p}}}},[1] [2] named for Colin Lingwood Mallows, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of C p {\textstyle C_{p}} {\textstyle C_{p}} means that the model is relatively precise.

Mallows's C p {\displaystyle C_{p}} {\displaystyle C_{p}} is 'essentially equivalent'[3] to the Akaike information criterion in the case of linear regression. This equivalence is only asymptotic; Akaike [4] notes that C p {\displaystyle C_{p}} {\displaystyle C_{p}} requires some subjective judgment in the choice of the variance estimate associated with each response in the linear model (typically denoted as σ ^ 2 {\displaystyle {\hat {\sigma }}^{2}} {\displaystyle {\hat {\sigma }}^{2}}).

Definition and properties

[edit ]

Mallows's C p {\displaystyle C_{p}} {\displaystyle C_{p}} addresses the issue of overfitting, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. Instead, the C p {\displaystyle C_{p}} {\displaystyle C_{p}} statistic calculated on a sample of data estimates the sum squared prediction error (SSPE) as its population target

E i ( Y ^ i E ( Y i X i ) ) 2 / σ 2 , {\displaystyle E\sum _{i}({\hat {Y}}_{i}-E(Y_{i}\mid X_{i}))^{2}/\sigma ^{2},} {\displaystyle E\sum _{i}({\hat {Y}}_{i}-E(Y_{i}\mid X_{i}))^{2}/\sigma ^{2},}

where Y ^ i {\displaystyle {\hat {Y}}_{i}} {\displaystyle {\hat {Y}}_{i}} is the fitted value from the regression model for the ith case, E(Yi | Xi) is the expected value for the ith case, and σ 2 {\displaystyle \sigma ^{2}} {\displaystyle \sigma ^{2}} is the error variance (assumed constant across the cases). The mean squared prediction error (MSPE) will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the effect sizes of the different predictors, and the degree of collinearity between them.

If p regressors are selected from a set of k regressors, with k > p, the C p {\displaystyle C_{p}} {\displaystyle C_{p}} statistic for that particular set of regressors is defined as:

C p = S S E p S 2 N + 2 ( p + 1 ) , {\displaystyle C_{p}={SSE_{p} \over S^{2}}-N+2(p+1),} {\displaystyle C_{p}={SSE_{p} \over S^{2}}-N+2(p+1),}

where

  • S S E p = i = 1 N ( Y i Y ^ p i ) 2 {\displaystyle SSE_{p}=\sum _{i=1}^{N}(Y_{i}-{\hat {Y}}_{pi})^{2}} {\displaystyle SSE_{p}=\sum _{i=1}^{N}(Y_{i}-{\hat {Y}}_{pi})^{2}} is the error sum of squares for the model with p regressors,
  • Y ^ p i {\displaystyle {\hat {Y}}_{pi}} {\displaystyle {\hat {Y}}_{pi}} is the predicted value of the ith observation of Y from the p regressors,
  • S2 is the estimation of residuals variance after regression on the complete set of k regressors and can be estimated by 1 N k i = 1 N ( Y i Y ^ i ) 2 {\displaystyle {1 \over N-k}\sum _{i=1}^{N}(Y_{i}-{\hat {Y}}_{i})^{2}} {\displaystyle {1 \over N-k}\sum _{i=1}^{N}(Y_{i}-{\hat {Y}}_{i})^{2}},[1]
  • and N is the sample size.

Alternative definition

[edit ]

Given a linear model such as:

Y = β 0 + β 1 X 1 + + β p X p + ε {\displaystyle Y=\beta _{0}+\beta _{1}X_{1}+\cdots +\beta _{p}X_{p}+\varepsilon } {\displaystyle Y=\beta _{0}+\beta _{1}X_{1}+\cdots +\beta _{p}X_{p}+\varepsilon }

where:

  • β 0 , , β p {\displaystyle \beta _{0},\ldots ,\beta _{p}} {\displaystyle \beta _{0},\ldots ,\beta _{p}} are coefficients for predictor variables X 1 , , X p {\displaystyle X_{1},\ldots ,X_{p}} {\displaystyle X_{1},\ldots ,X_{p}}
  • ε {\displaystyle \varepsilon } {\displaystyle \varepsilon } represents error

An alternate version of C p {\displaystyle C_{p}} {\displaystyle C_{p}} can also be defined as:[5]

C p = 1 N ( RSS + 2 p σ ^ 2 ) {\displaystyle C_{p}={\frac {1}{N}}(\operatorname {RSS} +2p{\hat {\sigma }}^{2})} {\displaystyle C_{p}={\frac {1}{N}}(\operatorname {RSS} +2p{\hat {\sigma }}^{2})}

where

  • RSS is the residual sum of squares on a training set of data
  • p is the number of predictors
  • and σ ^ 2 {\displaystyle {\hat {\sigma }}^{2}} {\displaystyle {\hat {\sigma }}^{2}} refers to an estimate of the variance associated with each response in the linear model (estimated on a model containing all predictors)

Note that this version of the C p {\displaystyle C_{p}} {\displaystyle C_{p}} does not give equivalent values to the earlier version, but the model with the smallest C p {\displaystyle C_{p}} {\displaystyle C_{p}} from this definition will also be the same model with the smallest C p {\displaystyle C_{p}} {\displaystyle C_{p}} from the earlier definition.

Limitations

[edit ]

The C p {\displaystyle C_{p}} {\displaystyle C_{p}} criterion suffers from two main limitations[6]

  1. the C p {\displaystyle C_{p}} {\displaystyle C_{p}} approximation is only valid for large sample size;
  2. the ' C p {\displaystyle C_{p}} {\displaystyle C_{p}} cannot handle complex collections of models as in the variable selection (or feature selection) problem.[6]

Practical use

[edit ]

The C p {\displaystyle C_{p}} {\displaystyle C_{p}} statistic is often used as a stopping rule for various forms of stepwise regression. Mallows proposed the statistic as a criterion for selecting among many alternative subset regressions. Under a model not suffering from appreciable lack of fit (bias), C p {\displaystyle C_{p}} {\displaystyle C_{p}} has expectation nearly equal to p; otherwise the expectation is roughly P plus a positive bias term. Nevertheless, even though it has expectation greater than or equal to p, there is nothing to prevent Cp < p or even C p < 0 {\displaystyle C_{p}<0} {\displaystyle C_{p}<0} in extreme cases. It is suggested that one should choose a subset that has C p {\displaystyle C_{p}} {\displaystyle C_{p}} approaching p,[7] from above, for a list of subsets ordered by increasing p. In practice, the positive bias can be adjusted for by selecting a model from the ordered list of subsets, such that C p < 2 p {\displaystyle C_{p}<2p} {\displaystyle C_{p}<2p}.

Since the sample-based C p {\displaystyle C_{p}} {\displaystyle C_{p}} statistic is an estimate of the MSPE, using C p {\displaystyle C_{p}} {\displaystyle C_{p}} for model selection does not completely guard against overfitting. For instance, it is possible that the selected model will be one in which the sample C p {\displaystyle C_{p}} {\displaystyle C_{p}} was a particularly severe underestimate of the MSPE.

Model selection statistics such as C p {\displaystyle C_{p}} {\displaystyle C_{p}} are generally not used blindly, but rather information about the field of application, the intended use of the model, and any known biases in the data are taken into account in the process of model selection.

See also

[edit ]

References

[edit ]
  1. ^ a b Mallows, C. L. (1973). "Some Comments on CP". Technometrics. 15 (4): 661–675. doi:10.2307/1267380. JSTOR 1267380.
  2. ^ Gilmour, Steven G. (1996). "The interpretation of Mallows's Cp-statistic". Journal of the Royal Statistical Society, Series D. 45 (1): 49–56. JSTOR 2348411.
  3. ^ Hirotugu Akaike (1973). "Information Theory and an Extension of the Maximum Likelihood Principle". Proceeding of the Second International Symposium on Information Theory: 267–281. Wikidata Q134962967.
  4. ^ Hirotugu Akaike (December 1974). "A New Look at the Statistical Model Identification". IEEE Transactions on Automatic Control. 19 (6): 716–723. doi:10.1109/TAC.1974.1100705. ISSN 0018-9286. PMID 40793584. Zbl 0314.62039. Wikidata Q26778401.
  5. ^ Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer Science+Business Media. doi:10.1007/978-1-4614-7138-7. ISBN 978-1-4614-7137-0. LCCN 2013936251. OCLC 1004563473. OL 26184759M. Zbl 1281.62147. Wikidata Q21473973.
  6. ^ a b Giraud, C. (2015), Introduction to high-dimensional statistics, Chapman & Hall/CRC, ISBN 9781482237948
  7. ^ Daniel, C.; Wood, F. (1980). Fitting Equations to Data (Rev. ed.). New York: Wiley & Sons, Inc.

Further reading

[edit ]

AltStyle によって変換されたページ (->オリジナル) /