The Minimum Description Length Principle (MDL)
Last update: 21 Apr 2025 21:17
First version:
MDL is an information-theoretic
approach to machine learning,
or statistical model selection, which
basically says you should pick the model which gives you the most compact
description of the data, including the description of the model itself. More
precisely, given a probabilistic model, Shannon's coding theorems tell you the
minimal number of bits needed to encode your data, i.e., the maximum extent to
which it can be compressed. Really, however, to complete the description, you
need to specify the model as well, from among some set of alternatives, and
this will also require a certain number of bits. Hence you really want to
minimize the combined length of the description of the model, plus the
description of the data under that model. This works out to being a kind of
penalized maximum likelihood — the data-given-model bit is the negative
log likelihood, and the model-description term is the penalty.
It's a very appealing idea, and a lot of work has (rightly) been done under
this heading, though I have to say I'm not altogether convinced, both because
of the issues involved in chosing a coding scheme for models, and because it's
not clear that, in practice, it actually does that much better than
other penalization schemes — sometimes, even, than straightforward
likelihood maximization.
I should also say that what I've described above is the old-fashioned
"two-part" MDL, and there are now "one-part" schemes, where (so to speak) the
model coding scheme is supposed to be fixed by the data as well, in unambiguous
and nearly-optimal ways. These are remarkably ingenious constructions,
intimately linked to low-regret
learning, as well as minimax statistics more generally, but I will not
attempt to describe them here; see rather Grünwald.
See also:
Algorithmic Information Theory;
Occam's Razor;
Universal Prediction Algorithms
Recommended, details and applications:
- Pedro
Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data
Mining and Knowledge Discovery, 3 (1999) [Online]
- Peter Grünwald and John Langford, "Suboptimal behaviour of Bayes
and MDL in classification under misspecification", math.ST/0406221
= Machine
Learning 66 (2007): 119--149
- Peter T. Hraber, Bette T. Korber, Steven Wolinsky, Henry Erlich and
Elizabeth Trachtenberg, "HLA and HIV Infection Progression: Application of the
Minimum Description Length Principle to Statistical Genetics", SFI
Working Paper 03-04-23
- Shane Legg, "Is There an Elegant Universal Theory of
Prediction?", cs.AI/0606070 [A
nice set of diagonalization arguments against the hope of a universal
prediction scheme which has the nice features of Solomonoff-style induction,
but is actually computable.]
- Beong Soo So, "Maximized log-likelihood updating and model
selection", Statistics and Probability Letters 64
(2003): 293--303 [Shows how to relate some of Rissanen's ideas on predictive
MDL to more conventionally-statistical notions, e.g., connecting Rissanen's
"stochastic complexity" to something that looks like, but isn't quite, a Fisher
information.]
- Paul M. B. Vitanyi and Ming Li, "Minimum Description Length
Induction, Bayesianism, and Kolmogorov Complexity", IEEE Transactions on
Information Theory 46 (2000): 446--464 = cs.LG/9901014
Not recommended:
- Dana Ballard, An Introduction to Natural Computation
[Review: Not Natural Enough]
To read (with thanks to David
Dowe for suggestions):
- Pieter Adriaans and Paul Vitanyi, "The Power and Perils of MDL",
cs.LG/0612095
- Joshua W. Comley and David L. Dowe, "Minimum Message Length and Generalized Bayesian Nets with Asymmetric Languages", in Grünwald et al.
- The Computer Journal, 42:4 (1999)
[Special issue on Kolmogorov complexity and inference]
- Steven de Rooij and Peter Grünwald, "An Empirical Study of MDL
Model Selection with Infinite Parametric Complexity", cs.LG/0501028
- Dowe, Korb and Oliver (eds.), Information, Statistics and
Induction in Science
- M. Drmota and W. Szpankowski, "Precise minimax redundancy and
regret", IEEE Transactions on Information Theory 50 (2004): 2686--2707
- Dean
P. Foster and Robert
A. Stine, "Local Asymptotic Coding and the Minimum Description Length", IEEE Transactions on
Information Theory 45 (1999): 1289--1293 [PDF
preprint]
- Ciprian Doru Giurcuaneanu and Jorma Rissanen, "Estimation of AR and
ARMA models by stochastic
complexity", math.ST/0702765
- Peter D. Grünwald, In Jae Myung and Mark A. Pitt (eds.),
Advances in Minimum Description Length: Theory and Applications
- Marcus Hutter
- "General Loss Bounds for Universal Sequence Prediction,"
cs.AI/0101019
- "On Generalized Computable Universal Priors and their
Convergence", cs.LG/0503026
- "Optimal Sequential Decisions based on Algorithmic
Probability," cs/0306091
- "Towards a Universal Theory of Artificial
Intelligence based on Algorithmic Probability and Sequential Decision Theory,"
cs.AI/0012011
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive
Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory
50 (2004): 2708--2726
- Daniel J. Navarro, "A Note on the Applied Use of MDL Approximations",
Neural
Computation 16 (2004): 1763--1768
- S. L. Needham and D. L. Dowe, "Message Length as an Effective
Ockham's Razor in Decision Tree Induction", pp. 253--260 in AI+STATS
2001 [Available from Prof. Dowe in several formats]
- Jan Poland and Marcus Hutter, "Asymptotics of Discrete MDL for
Online
Prediction",
IEEE Transactions on
Information Theory 51 (2005):
3780--3795, arxiv:cs.IT/0506022
- Guoqi Qian and Hans R. Künsch, "Some Notes on Rissanen's
Stochastic Complexity", Tech. Report 79, Seminar fur Statistik, ETH-Zurich
(1996) [Online but I've mislaid the URL at the moment]
- Jorma Rissanen, Lectures on Statistical Modeling
Theory [PDF]
- Teemu Roos, Petri Myllymäki and Jorma Rissanen, "MDL Denoising
Revisited", cs.IT/0609138
- Michael Small, "Optimal time delay embedding for nonlinear time
series modeling", nlin.CD/0312011
- Michael Small and C. K. Tse, "Optimal embedding parameters: A
modeling paradigm", physics/0308114
- Nikolai Vereshchagin and Paul Vitanyi, "Kolmogorov's Structure
Functions with an Application to the Foundations of Model Selection,"
cs.CC/0204037
- C. S. Wallace, Statistical and Inductive Inference by Minimum
Message Length
- C. S. Wallace and David L. Dowe, "Minimum Message Length and Kolmogorov Complexity", The Computer Journal 42 (1999): 270--283
- Bin Yu, Mininum Description Length Principle
[A page collecting references to Prof. Yu's work on MDL, largely viewing it
as a rather conventional form of model
selection]