Post-Model-Selection Inference
Last update: 21 Aug 2025 11:53
First version: 14 August 2019
Model selection, in statistics, means
using your data to pick the correct statistical model, or at least a good one.
Often we're interested in doing statistical inference with the selected model
--- we might want to know confidence sets
for parameters (or functions), we might want to attach measures of uncertainty
to its predictions, etc. The difficulty is that we usually calculate the
properties of our inferential procedures on the assumption of a fixed
model, as though the right model
were communicated to us by the angels. When instead it's something we've
selected using the data, there are going to be problems.
The easiest way to see this may be to reflect that our data are random
(that's why we're doing statistics), so which model we get from our model
selection is also random (at least a little), and this will create
correlations between the selected model and the outputs of statistical
tests. If we're doing regression and we've used model selection to
pick which variables are included as
regressors, of course the selected variables are going to look
significant on the data we used to pick them! (Thus the classic Freedman,
1983, which never fails to make a mind-blowing assignment for undergrads.) The
whole rest of this subject is essentially refining this basic observation.
One direction of refinement is to try to develop new inferential procedures,
more or less approximate, which can compensate for the fact that our model was
picked in a data-dependent way. This is most of what gets called
"post-selection inference" or "post-model-selection inference" or "selective
inference". There is a lot of intricate theory here, often relying on clever
mathematical understanding of specific selection procedures and how they
interact with specific assumptions about the data-generating process.
The other direction is to attack the problem at its root: using the
same data for selection and inference creates correlations between them, so
use different data for selection and inference. This gets called
"data splitting" or "sample splitting". It's easy to do for IID data ---
divide your data set, at random, into two parts, do your selection on one part,
and then do the inference on the other, with no cross-contamination. (This is
close to, but not quite, cross-validation.)
Because they're independent, the selected model is independent of the contents
of the inference set, hence the usual procedures work with their usual
properties. Problem solved.
Sample splitting is a simple, radical, almost a-theoretical way to solve the
problem of post-selection inference, and as such it appeals to my temperament.
(This is why two of my students wrote their dissertations, in part, on how to
extend it to dependent data, where, alas, theory and subtlety re-enter.) With
all sincere respect to those working heroically on what I called the other
direction, honestly don't know why the sample-splitting approach isn't
the default we all use.
Recommended (including by reference recommendations listed under model selection):
- Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao, "Valid post-selection inference", Annals of Statistics 41 (2013): 802--837
- Julian J. Faraway
- William Fithian, Dennis Sun, Jonathan Taylor, "Optimal Inference After Model Selection", arxiv:1410.2597
- David A. Freedman, "A Note on Screening Regression Equations",
The American Statistician 37 (1983): 152--155
- Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor, "Exact post-selection inference, with application to the lasso", arxiv:1311.6238
- Hannes Leeb
- "Conditional Predictive Inference Post Model
Selection", Annals of Statistics 37 (2009):
2838--2876, arxiv:0908.3615
- "Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process", Bernoulli 14 (2008): 661--690,
arxiv:0802.3364
- Hannes Leeb and Benedikt M. Pötscher
- Alessandro Rinaldo, Larry Wasserman, Max G'Sell, Jing Lei, "Bootstrapping and Sample Splitting For High-Dimensional, Assumption-Free Inference", arxiv:1611.05401 [Disclaimer: All
colleagues and friends]
Pride compels me to recommend:
- Robert Lunde, Bootstrapping and Sample Splitting Under Weak Dependence [Ph.D. thesis, CMU Statistics, 2018]
- Lawrence Wang, Network Comparisons using Sample Splitting [Ph.D. thesis, CMU Statistics, 2016]
To read:
- Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, Christian Hansen, "Program Evaluation and Causal Inference with High-Dimensional Data",
Econometrica 85 (2017): 233--298,
arxiv:1311.2645
- Yoav Benjamini, Marina Bogomolov, "Adjusting for selection bias in testing multiple families of hypotheses", arxiv:1106.3670
- Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model
Selection and Subsequent Selection Bias in Performance
Evaluation", Journal
of Machine Learning Research 11 (2010): 2079--2107
- Victor Chernozhukov, Christian Hansen, Martin Spindler, "Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach",
Annual Review of Economics 7 (2015): 649--688, arxiv:1501.03430
- Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, "Preserving Statistical Validity in Adaptive Data Analysis", arxiv:1411.2664
- Karl Ewald, Ulrike Schneider, "Uniformly Valid Confidence Sets Based on the Lasso", Electronic Journal of Statistics 12 (2018): 1358--1387, arxiv:1507.05315
- Paul Kabaila and Khageswor Giri, "Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection", arxiv:0711.0993
- Benedikt M. Pötscher [I've heard Prof. Pötscher talk about this work multiple times now at conferences, but I still need to really master it]
- "The distribution of model averaging
estimators and an impossibility result regarding its estimation", arxiv:math/0702781
- "Confidence sets based on sparse estimators
are necessarily large", arxiv:0711.1036
- Yoshikazu Terada, Hidetoshi Shimodaira, "Selective inference after variable selection via multiscale bootstrap", arxiv:1905.10573 [I presume they have an answer to "why not just use sample splitting?"]
- Xiaoying Tian, Jonathan Taylor, "Asymptotics of selective inference", arxiv:1501.03588
- Tijana Zrnic, Michael I. Jordan, "Post-Selection Inference via Algorithmic Stability", arxiv:2011.09462