Notebooks

The Minimum Description Length Principle (MDL)

Last update: 21 Apr 2025 21:17
First version:

MDL is an information-theoretic approach to machine learning, or statistical model selection, which basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself. More precisely, given a probabilistic model, Shannon's coding theorems tell you the minimal number of bits needed to encode your data, i.e., the maximum extent to which it can be compressed. Really, however, to complete the description, you need to specify the model as well, from among some set of alternatives, and this will also require a certain number of bits. Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model. This works out to being a kind of penalized maximum likelihood — the data-given-model bit is the negative log likelihood, and the model-description term is the penalty.

It's a very appealing idea, and a lot of work has (rightly) been done under this heading, though I have to say I'm not altogether convinced, both because of the issues involved in chosing a coding scheme for models, and because it's not clear that, in practice, it actually does that much better than other penalization schemes — sometimes, even, than straightforward likelihood maximization.

I should also say that what I've described above is the old-fashioned "two-part" MDL, and there are now "one-part" schemes, where (so to speak) the model coding scheme is supposed to be fixed by the data as well, in unambiguous and nearly-optimal ways. These are remarkably ingenious constructions, intimately linked to low-regret learning, as well as minimax statistics more generally, but I will not attempt to describe them here; see rather Grünwald.

See also: Algorithmic Information Theory; Occam's Razor; Universal Prediction Algorithms

    Recommended, details and applications:
  • Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
  • Peter Grünwald and John Langford, "Suboptimal behaviour of Bayes and MDL in classification under misspecification", math.ST/0406221 = Machine Learning 66 (2007): 119--149
  • Peter T. Hraber, Bette T. Korber, Steven Wolinsky, Henry Erlich and Elizabeth Trachtenberg, "HLA and HIV Infection Progression: Application of the Minimum Description Length Principle to Statistical Genetics", SFI Working Paper 03-04-23
  • Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
  • Beong Soo So, "Maximized log-likelihood updating and model selection", Statistics and Probability Letters 64 (2003): 293--303 [Shows how to relate some of Rissanen's ideas on predictive MDL to more conventionally-statistical notions, e.g., connecting Rissanen's "stochastic complexity" to something that looks like, but isn't quite, a Fisher information.]
  • Paul M. B. Vitanyi and Ming Li, "Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity", IEEE Transactions on Information Theory 46 (2000): 446--464 = cs.LG/9901014
    Not recommended:
  • Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]


permanent link for this note RSS feed for this note

Notebooks :

AltStyle によって変換されたページ (->オリジナル) /