Machine Learning, Statistical Inference and Induction
Last update: 21 Apr 2025 21:17
First version:
There's a place where AI, statistics and epistemology-methodology converge, or want to anyhow.
"Machine learning" is the AI label: how do we make a machine that can find and
learn the regularities in a data set? (If the data set is really, really big,
and we care mostly about making practically valuable predictions, this becomes
data mining, or "knowledge discovery in
databases," KDD.) The statisticians ask very similar questions about
model-fitting and hypothesis-testing. The epistemologists are mired in the
problem of induction, and "inference to the best explanation" (a phrase, I am
told by Kenny Easwaran, coined by Gilbert Harman; link below). The fields
over-lap in the most crazy-quilt and arbitrary way: I've heard university
librarians arguing over whether specific books should go to the engineering or
the philosophy library, for instance.
The connection to neuroscience and cognitive science is plain: how on Earth do
human beings, and other critters, actually learn? Given that there are many
different strategies, which ones do organisms use, and why, and are they good
ones? (It's entirely possible that we've gotten locked in to inefficient
learning strategies; then the question becomes whether or not they can be
improved.) Studying learning by organisms lets us test theories of
learning-in-the-abstract, and vice versa: if we had, say, a good proof that a
certain learning scheme simply would not work, we'd know that animals
don't use it.
One fairly strong result seems to be that tabulae rasae don't work:
you've got to give the machine/baby/scientist some hints, or restrict
the field of possible hypotheses initially, or you'll never get anywhere. This
was at least implicit in Hume, and I believe the other
classical empiricists as well, but they don't seem to have been restrictive
enough to account for the way we actually do learn. Natural selection is the obvious candidate for
having restricted our hypothesis-set, and for having designed our learning
mechanisms.
My positivist temperament can hardly help
being pleased by this "attempt to introduce the experimental method of
reasoning into moral subjects," which, as data mining,
has massive industrial applications. My real
interest in this isn't, for once, philosophical. Instead, I want to be able to
quantify, or at the very least
characterize, self-organization, which
means I need a good way of automatically finding patterns or regularities in
data-sets. For someone who's got
the computational mechanics gospel,
this means "inferring statistical complexity," and that means the automated
construction of abstract-machine or formal-language models of data-sets.
(Alternately: Figuring out how natural things compute.) And doing that well
means addressing all the issues people in these areas address, so I figure I
ought to just steal from them.
Recommended, big picture:
- Leo Breiman, "Statistical Modeling: The Two Cultures",
Statistical
Science 16 (2001): 199--231 [Very much including
the discussion by others and the reply by Breiman. Thanks to
Chris Wiggins for alerting me to
this.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning,
and Games [Mini-review]
- Ulf Grenander, Elements of Pattern Theory
- David Hand, Heikki Mannila and Padhraic Smyth, Principles
of Data Mining
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The
Elements of Statistical Learning: Data Mining, Inference, and Prediction
[Website, with full text free in PDF]
- John H. Holland, Keith J. Holyoak,
Richard E. Nisbett, and Paul R. Thagard, Induction: Process of
Inference, Learning and Discovery
[Review: The Best-Laid Schemes o' Mice an'
Men]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to
Computational Learning Theory
[Review: How to Build a Better
Guesser]
- Deborah G. Mayo, Error and the Growth of Experimental
Knowledge [How to use standard statistical tests to learn from
experiment, without Bayesian priors or other a priori folderol. Review: We Have Ways of Making You Talk, or, Long Live
Peircism-Popperism-Neyman-Pearson Thought!]
- Deborah G. Mayo and D. R. Cox, "Frequentist statistics as a theory
of inductive
inference", math.ST/0610846
- John Norton, "A Material Theory of Induction", Philosophy of Science 70 (2003): 647--670 [PDF reprint]
- Jorma Rissanen, Stochastic Complexity in Statistical
Inquiry [Review: Less Is
More, or, Ecce data!]
- Sara J. Shettleworth, Cognition, Evolution and
Behavior
- Peter Spirtes, Clark Glymour and Richard Scheines,
Causation, Prediction, and Search
- Chris Thornton, Truth from Trash: How Learning Makes
Sense [Well, half a recommendation. Review: Two Cheers for Trash]
- V. N. (=Vladimir Naumovich) Vapnik, The Nature of
Statistical Learning Theory [Review:
A Useful Biased Estimator]
- H. Peyton Young, Individual Strategy and Social
Structure [Pretty dumb agents nonetheless able to learn in a basic
sense, and what they can accomplish in the way of societies. Review: A Myopic (and Sometimes
Blind) Eye on the Main Chance, or, the Origins of Custom]
Recommended, close-ups:
- Shun-ichi Amari, "Information Geometry on Hierarchical
Decomposition of Stochastic Interactions," IEEE Transactions on
Information Theory 47 (2001): 1701-11 [A way of finding
"parts" in complex distributions; uses many differential geometry tricks to
do statistics. PDF
reprint]
- Massimiliano Badino, "An Application of Information Theory to the
Problem of the Scientific
Experiment", Synthese 140 (2004): 355--389 [MS Word preprint.
See comments under Information Theory.]
- David Balduzzi
- H. B. Barlow, "Unsupervised Learning",
Neural Computation1 (1989): 295--311
- Jonathan Baxter, "A Model of Inductive Bias Learning,"
Journal of Artificial Intelligence Research 12
(2000): 149--198 [How to learn what class of hypotheses you should be trying to
use, i.e., your inductive bias. Assumes independence, again.]
- Mikhail Belkin, Partha Niyogi, Vikas Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples", Journal of Machine Learning Research 7 (2006): 2399--2434
- William Bialek, Ilya Nemenman, and Naftali Tishby,
"Predictability, Complexity and Learning," physics/0007070
- Ken Binmore, "Making Decisions in Large Worlds" ["This
paper argues that we need to look beyond Bayesian decision theory for an answer
to the general problem of making rational decisions under
uncertainty." PDF
manuscript; thanks to Nicolas Della Penna for the pointer]
- Margaret Boden, The Creative Mind: Myths and
Mechanisms [How and when to change the kind of representation you're
using, a topic shamefully neglected in the literature.
Precis]
- Josh Bongard and Hod Lipson, "Automated reverse engineering of
nonlinear dynamical
systems", Proceedings
of the National Academy of Sciences (USA) 104 (2007):
9943--9948 [Thanks to Chris Weed for pointing me to this. Interesting, but
basically unaware of the literature
on state-space reconstruction in
nonlinear dynamics.]
- R. B. Braithwaite, Scientific Explanation
- Arthur W. Burks, "Peirce's Theory of Abduction", Philosophy of Science 13 (1946): 301--306 [JSTOR; ungated copy]
- Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA) 110
(2013): E1181--E1190, arxiv:1211.1073
- Pedro
Domingos
- "The Role of Occam's Razor in Knowledge Discovery," Data
Mining and Knowledge Discovery, 3 (1999) [Online]
- "A Few Useful Things to Know about Machine Learning"
[PDF preprint]
- Marco Dorigo and Marco Colombetti, Robot Shaping: An
Experiment in Behavior Engineering [Review: Crawling Towards the Light]
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola,
"Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in
NIPS 12 (1999) [PDF
reprint. I'd call this more of a semi-parametric approach than a fully
non-parametric one; they assume a parametric form for the dependence structure,
but are agnostic about the distributions of innovations, and so try to maximize
non-parametrically estimated mutual informations.]
- Francois Fleuret and Donald Geman, "Stationary Features and Cat
Detection", Journal of
Machine Learning Research 9 (2008): 2549--2578
- Peter Godfrey-Smith, "Inductions, Samples, and Kinds"
[PDF preprint]
- David J. Hand, "Classifier Technology and the Illusion of Progress",
Statistical
Science 21 (2006):
1--15, math.ST/0606441
[Or: don't believe everything you read in ICML! With commentary, available
from the arxiv.org link]
- Hinton and Sejnowski (eds.), Unsupervised Learning
[A sort of "Neural Computation's Greatest Hits" compilation]
- Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, arxiv:2110.01584
- Tommi S. Jaakkola and David Haussler, "Exploiting generative models
in discriminative classifiers", NIPS 11 (1998)
[PDF]
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing
Attribute Interactions", cs.AI/0308002
- Kevin T. Kelly
[Kelly's work on Occam's Razor is, so far as I know, the only justification for it which doesn't either massively beg the question, change the subject, or make
massive assumptions about the nature of the world, Divine Providence, etc.]
- Shane Legg, "Is There an Elegant Universal Theory of
Prediction?", cs.AI/0606070 [A
nice set of diagonalization arguments against the hope of a universal
prediction scheme which has the nice features of Solomonoff-style induction,
but is actually computable.]
- Jerzy Neyman, First Course in Probability and
Statistics [Fine explanation of his ideas about "rules of inductive
behavior" --- which probably isn't very good methodology, but has the makings
of excellent robotics]
- Leonid Peshkin, "Structure induction by lossless graph compression",
cs.DS/0703132 [Adapting
data-compression ideas to discover hierarchical structures in graphs, e.g., the
4 bases from a tinker-toy model of DNA.]
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning",
NIPS 2008
- Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer and Andrew Y. Ng, "Self-taught learning: Transfer learning from unlabeled data",
ICML 2007
[PDF.
This is a clever idea for semi-supervised learning. Given a big supply of
unlabeled examples, and a small number of labeled examples, use the unlabeled
ones to learn a high-level/abstract representation or set of features. Then
use those features in straightforward classifier learning on the
labeled data. (They have a specific idea for learning the higher-level
representation,
by basis
selection, but that's a separable issue.)]
- Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners
of the World, Unite!]
- Gerhard Schurz, "Universal vs. Local Prediction Strategies: A
Game-Theoretical Approach to the Problem of
Induction", phil-sci/3720
[Slides only?!?]
- Spyros Skouras, "Decisionmetrics: Towards a Decision-Based
Approach to Econometrics" [Suppose what you really want to do with your model
is to make decisions, e.g., to buy and sell and make money doing so. Then
fitting the model to minimize a standard error measure, e.g., mean square
error, often gives worse performance than fitting the model to minimize
expected losses. This applies much more broadly than Spyros's financial
examples may suggest.]
- Aris Spanos, "The Curve-Fitting Problem, Akaike-type Model
Selection, and the Error Statistical Approach"
[PDF
preprint]
- Sara van de Geer, Applications of Empirical Process
Theory [A.k.a. Empirical Process Theory in
M-Estimation]
- Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
- Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic
Learning in a Random World [Mini-review]
- Blaz Zupan, Marko Bohanec, Janez Demsar and Ivan Bratko, "Learning
by discovering concept hierarchies", Artificial
Intelligence 109 (1999): 211--242 [Thanks to Aleks
Jakulin for letting me know about this. PDF preprint]
Not exactly recommended:
- Dana Ballard, An Introduction to Natural Computation
[Review: Not Natural Enough]
- Jacob Feldman, "How surprising is a simple pattern? Quantifying
'Eureka!'," Cognition
93(2004): 199--224 [Claims to (a) have a psychologically
valid measure of subjective complexity, and (b) derive a null
distribution for it. But the evidence that his particular
complexity measure captures what people do in concept-learning problems
is deferred to other papers.]
- Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning:
Induction and Statistical Learning Theory [Published by MIT Press; 2006
draft free
online via Prof. Kulkarni (about 100 pages). The technical material
on learning theory is mostly alright, so far
as it goes, but the philosophy is irritatingly lack-luster.
Definitely not worth paying what the publisher charges for it. — There is
now a good review by Kevin
Kelly and Conor Mayo-Wilson.]
To read:
- Steven Abney,
"Bootstrapping", ACL 2002,
pp. 360--367 [In the sense of "a problem setting in which one is given a
small set of labeled data and a large set of unlabeled data, and the task is to
induce a classifier", not the famous statistical
procedure]
- Tatsuya Akutsu, Satoru Miyanoa and Satoru Kuhar, "A simple greedy
algorithm for finding functional relations: efficient implementation and
average case analysis," Theoretical
Computer Science 292 (2002): 481--495
- Atocha Aliseda, Abductive Reasoning: Logical Investigations
into Discovery and Explanation
- Andris Ambainis, "Probabilistic inductive inference: a survey",
cs.LG/9902026 [Taking
"inductive inference" exclusively in the sense of learning recursive
functions]
- Rosa I. Arriaga and Santosh Vempala, "An algorithmic theory of learning: Robust concepts and random projection", Machine Learning 63 (2006): 161--182
- Nihat Ay
- "Locality of global stochastic interaction in directed
acyclic networks," preprint, MPI-MIS
54/2001
- "An information geometric approach to a theory of
pragmatic structuring," MPI-MIS 52/2000
- Vijay Balasubramanian, "Statistical Inference, Occam's Razor,
and Statistical Mechanics on the Space of Probability Distributions",
Neural Computation 9 (1997): 349--368, arxiv:cond-mat/9601030
- Pierre Baldi et al., Modeling the Internet and the Web:
Probabilistic Methods and Algorithms
- William Bechtel and Robert C. Richardson, Discovering
Complexity: Decomposition and Localization as Strategies in Scientific
Research
- Sergey V. Beiden, Marcus A. Maloof and Robert F. Wagner, "A
General Model for Finite-Sample Effects in Training and Testing of Competing
Classifiers", IEEE Transactions on Pattern Analysis and Machine
Intelligence 25 (2003): 1561--1569
- Ron Bekkerman, Mikhail Bilenko and John Langford (eds.), Scaling
up Machine Learning: Parallel and Distributed Approaches
- D. Paul Benjamin (ed.), Change of Representation and
Inductive Bias
- James Blachowicz, Of Two Minds: The Nature of Inquiry
[From the back cover: "The logic of correction developed here
directly opposes the claim made by evolutionary
epistemologists such as Popper and Campbell that
there is no such thing as a 'logical method for having new ideas.' ... This
comprehensive and revolutionary theory challenges traditional epistemology's
conception of justification and provides substantial new interpretations of the
nature of ampliative inference, representation and meaning, Platonic and
Hegelian dialectic, Kantian analysis, the heuristic function of models and
metaphors, and the role of inquiry in the constitution of human
consciousness." All this in only four hundred pages! But the stuff on a
logic of correction is very important --- if correct.]
- Gilles Blanchard and Donald Geman, "Hierarchical testing designs
for pattern recognition", math.ST/0507421 = Annals of
Statistics 33 (2005): 1155--1202
- Avrim Blum, and Tom Mitchell, "Combing Labeled and Unlabeled
Data with Co-Training", COLT 98, pp. 92--100
- Avrim Blum, Adam Kalai and Hal Wasserman, "Noise-Tolerant
Learning, the Parity Problem, and the Statistical Query Model,"
cs.LG/0010022
- Leo Breiman, "Prediction Games and Arcing Algorithms," Neural
Computation 11 (1999): 1493--1517
- Robert Alan Brown, Machines that Learn: Based on the
Principle of Empirical Control
- Christopher J. C. Burges, "Dimension Reduction: A Guided Tour",
Foundations and Trends in Machine Learning 2:4 (2010) [Preprint version]
- Meir Buzaglo, The Logic of Concept Expansion
- Adam Cannon, J. Mark Ettinger, Don Hush, and Clint Scovel,
"Machine Learning with Data Dependent Hypothesis Classes," Journal of Machine Learning Research
2 (2002): 335--358
- Philip Ellery Catton, "The Justification(s) of Induction(s)," online
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets
Using Efficient Estimation of High-Dimensional Mutual Information", IEEE
Transactions on Neural Networks 16 (2005): 213--224
- Andy Clark and Chris Thornton, "Trading Spaces: Computation,
Representation and the Limits of Uninformed Learning," Behavioral and
Brain Sciences (1997) 20:57--90
[Draft]
- Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics",
Bayesian Analysis 5 (2010): 1--36
- David Corfield, "Varieties of Justification in Machine Learning",
Minds and Machines 20
(2010): 291--301
- Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
- Mark Culp, George Michailidis and Kjell Johnson, "On multi-view
learning with additive models", Annals of Applied
Statistics 3 (2009): 292--318
= arxiv:0906.1117
- H. Daume III, D. Marcu, "Domain Adaptation for Statistical Classifiers", arxiv:1109.6341
- Peter Dayan, "Recurrent Sampling Models for the Helmholtz
Machine," Neural
Computation 11 (1999): 653--677
- Carlos R. de la Mora B., Carlos Gershenson and Angelica
Garcia-Vega, "The role of behavior modifiers in representation development",
cs.AI/0403006
- Luc Devroye et al., A Probabilistic Theory of Pattern
Recognition
- Thomas G. Dietterich,
"Machine Learning for Sequential Data"
[PDF.
Thanks to Gustavo Lacerda for a pointer.]
- Nicola Di Mauro, Teresa M.A. Basile, Stefano Ferilli, Floriana Esposito, "Feature Construction for Relational Sequence Learning", arxiv:1006.5188
- Pedro Domingos [All from his web-site]
- A General Method for Scaling Up Machine Learning Algorithms
and its Application to Clustering
- Mining High-Speed Data Streams
- Mining Time-Changing Data Streams
- Dowe, Korb and Oliver (eds.), Information, Statistics and
Induction in Science
- Deniz Erdogmus, Kenneth E. Hild, II, Yadunandana N. Rao and
José C. Príncipe, "Minimax Mutual Information Approach for
Independent Component Analysis", Neural
Computation 16 (2004): 1235--1252
- Oleg V. Favorov and Dan Ryder, "SINBAD: A neocortical mechanism for
discovering environmental variables and regularities hidden in sensory
input", Biological
Cybernetics 90 (2004): 191--202
- Aidan Feeney and Evan Heit (eds.), Inductive Reasoning:
Experimental, Developmental, and Computational Approaches
- David Finton, "When Do Differences Matter? On-Line Feature
Extraction Through Cognitive Economy", cs.LG/0404032
= Cognitive
Systems Research 6 (2005): 263--281
- Gary William Flake, "The Calculus of Jacobian Adaptation"
- Francois Fleuret and Eric Brunet, "DEA: An Architecture for Goal
Planning and Classification," Neural Computation
12 (2000): 1987--2008
- Flocchini et al. (eds.), Structure, Information and
Communication Complexity
- Malcolm R. Forster, "How do Simple Rules 'Fit to Reality' in a
Complex World?", Minds and Machines 9 (1999):
543--564 [A take on the Gigerenzer et al. idea of fast and frugal heuristics,
especially their ecological adaptation to the evnironment. "The main purpose
of this article is to apply these ideas to learning rules --- methods for
constructing, selecting or evaluating competing hypotheses in science, and to
the methodology of machine learning... The bad news is that ecological
validity is particularly difficult to implement and difficult to understand.
The good news is that it builds an important bridge from normative psychology
and machine learning to recent work in the philosophy of science, which
considers predictive accuracy to be a primary goal of science."]
- Paul Franchesi, "A Solution to Goodman's Paradox,"
Dialogue 40 (2001) [online]
- Vinod Goel and Raymond J. Dolan, "Differential involvement of left
prefrontal cortex in inductive and deductive reasoning", Cognition
93 (2004): B109--B121
- Ulf Grenander, Abstract Inference
- Ulf Grenander and Michael Miller, Pattern Theory: From Representation to Inference
- Laszlo Gyorfi et al., A Distribution-Free Theory of
Nonparametric Regression
- Stephen José Hanson et al., eds., Computational
Learning Theory and Natural Learning Systems
- I: Constraints and Prospects
- II: Interactions between Theory and Experiment
- Petr Hajek and Martin Holena, "Formal logics of discovery and
hypothesis formation by machine," Theoretical
Computer Science 292 (2002): 345-357
- Peter Hall and Qiwei Yao, "Approximating conditional distribution
functions using dimension reduction", math.ST/0507432 = Annals of
Statistics 33 (2005): 1404--1421
- Gilbert H. Harman, "The Inference to the Best Explanation",
The Philosophical Review 74 (1965):
88--95 [JSTOR; thanks to Kenny
Easwaran for the pointer]
- Patrick Heas and Mihai Datcu, "Supervised learning on graphs of
spatio-temporal similarity in satellite image
sequences", 0709.3013
- David F. Hendry and Jurgen A. Doornik, Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics
- Jaako Hintikka
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka and Hannu
Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and
Approximate Dependencies," The Computer Journal
42 (1999): 100--111
- Eyke Hüllermeier, Willem Waegeman, "Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods", arxiv:1910.09457
- Christian Igel and Marc Toussaint, "On Classes of Functions for
which No Free Lunch Results Hold," cs.NE/0108011
- Lancelot F. James, David J. Marchette and Carey Priebe, "Consistent
estimation of mixture
complexity", Annals of
Statistics 29 (2001): 1281--1296
- John R. Josephson and Susan G. Josephson (eds.), Abductive
Inference: Computation, Philosophy, Technology
- Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many
strings are easy to predict?", Information and
Computation 201 (2005): 55--71 ["It is well known
in the theory of Kolmogorov complexity that most strings cannot be compressed;
more precisely, only exponentially few (O(2^n-m)) binary strings of length n
can be compressed by m bits. This paper extends the 'incompressibility'
property of Kolmogorov complexity to the 'unpredictability' property of
predictive complexity. The 'unpredictability' property states that predictive
complexity (defined as the loss suffered by a universal prediction algorithm
working infinitely long) of most strings is close to a trivial upper bound (the
loss suffered by a trivial minimax constant prediction strategy). We show that
only exponentially few strings can be successfully predicted and find the base
of the exponent."]
- Michael Kearns and Dana Ron, "Algorithmic Stability and
Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural
Computation 11 (1999): 1427--1453
- Kevin T. Kelly
- The Logic of Reliable Inquiry
[Includes cartoons by the author]
- Eric D. Kolaczyk and Robert D. Nowak, "Multiscale likelihood
analysis and complexity penalized estimation", math.ST/0406424 = Annals
of Statistics
32 (2004): 500--527
- Ingo Kreuz and Dieter Roller, "Relevant Knowledge First:
Reinforcement Learning and Forgetting in Knowledge Based Configuration," cs.AI/0109034
- Henry E. Kyburg Jr. and Choh Man Teng, "Evaluating Defaults," cs.AI/0207083
- Steffen Lange and Gunter Grieser, "Variants of iterative
learning," Theoretical
Computer Science 292 (2002): 359--376
- Nicolas Le Roux and Yoshua Bengio, "Deep Belief Networks Are Compact Universal Approximators", Neural
Computation 22 (2010): 2192--2207
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive
Density Estimation, Data Compression, and Model Selection", IEEE Transactions on
Information Theory 50 (2004): 2708--2726
- Stephen Luttrell, "Using Self-Organising Mappings to Learn the
Structure of Data Manifolds", cs.NE/0406017
- David J. C. MacKay, Information Theory, Inference and
Learning Algorithms [Online
version]
- Adrian Mackenzie, Machine Learners:
Archaeology of a Data Practice
- Sridhar Mahadevan, Representation Discovery Using Harmonic
Analysis
- Gideon S. Mann and Andrew McCallum, "Generalized
expectation criteria for semi-supervised learning with weakly
labeled data", Journal of Machine Learning Research
11 (2010): 955--984
- Heikki Mannila and Kari-Jouko Räihä, "On the complexity
of inferring functional dependencies," Discrete Applied
Mathematics 40 (1992): 237--243
- Martin and Osherson, Elements of Scientific Inquiry
[A good introduction to the theory of formal learning, especially of recursive
functions in the absence of noise. Not even hand-waving that this is a
sensible idealization of what scientists do.]
- Conor Mayo-Wilson, Combining Causal Theories and Dividing Scientific Labor [Ph.D. thesis, CMU Philosophy Dept., 2012; thanks to Dr. Mayo-Wilson for a copy]
- Geoffrey J. McLachlan, Discriminant Analysis and Statistical
Pattern Recognition
- Abraham Meidan and Boris Levin, "Choosing from Competing Theories
in Computerised Learning", Minds and Machines 12
(2002): 119--129
- I. J. Myung, Vijay Balasubramanian and M. A. Pitt, "Counting
probability distributions: Differential geometry and model selection",
Proceedings of the National Academy of Sciences (USA)
97 (2000): 11170--11175
- National Research Council, Massive Data Sets
[Online]
- O. Nelles, Nonlinear System Identification
- Ilya Nemenman, "Fluctuation-Dissipation Theorem and Models of
Learning", Neural
Computation 17 (2005): 2006--2033 ["We analyze how
various abstract Bayesian learners perform on different data and argue that it
is difficult to determine which learning-theoretic computation is performed by
a particular organism using just its performance in learning a stationary
target (learning curve). Based on the fluctuation-dissipation relation in
statistical physics, we then discuss a different experimental setup that might
be able to solve the problem."]
- Kamal Nigam and Rayid Ghani, "Analyzing the Effectiveness
and Applicability of Co-training", CIKM 2000, pp. 86--93
- Liam Paninski, "Asymptotic Theory of Information-Theoretic
Experimental Design", Neural
Computation 17 (2005): 1480--1507
- Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based
on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and
Min-Redundancy", IEEE
Transactions on Pattern Analysis and Machine
Intelligence 27 (2005): 1226--1238 [This sounds
like an idea I had in 2002, and was too dumb/lazy to follow up on.]
- Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau and Leslie Pack
Kaelbling, "Learning to Cooperate via Policy Search,"
cs.LG/0105032
- Leonid Peshkin and Christian R. Shelton, "Learning from Scarce
Experience,"
cs.AI/0204043
- Karl
Pfleger
- On-Line Learning of Undirected Sparse n-grams
- Learning Predictive Compositional Hierarchies
[PS.gz]
- Fenna H. Poletiek, Hypothesis Testing Behaviour
[Review by
Denny Borsboom]
- Joel B. Predd, Sanjeev R. Kulkarni and H. Vincent Poor
- "Consistency in Models for Distributed Learning under
Communication Constraints", cs.IT/0503071
- "Distributed Learning in Wireless Sensor Networks", cs.IT/0503072
- Detlef Prescher, "A Tutorial on the Expectation-Maximization
Algorithm Including Maximum-Likelihood Estimation and EM Training of
Probabilistic Context-Free Grammars", cs.CL/0412015
- Vasin Punyakanok and Dan Roth, "The Use of Classifiers in
Sequential Inference,"
cs.LG/0111003
- Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence (eds.), Dataset Shift in Machine Learning
- Maxim Raginsky, "A complexity-regularized quantization approach to
nonlinear dimensionality reduction", cs.IT/0501091
- Magnus Rattray, "Stochastic trapping in a solvable model of
on-line independent component analysis,"
cond-mat/0105057
- Suman Ravuri, Mélanie Rey, Shakir Mohamed, Marc Deisenroth, "Understanding Deep Generative Models with Generalized Empirical Likelihoods", arxiv:2306.09780
- Salah Rifai, Yoshua Bengio, Yann Dauphin, Pascal Vincent, "A Generative Process for Sampling Contractive Auto-Encoders", ICML 2012, arxiv:1206.6434
- Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito, "On Learning with
Integral
Operators", Journal
of Machine Learning Research 11 (2010): 905--934
- Dan Roth, "Learning in Natural Language: Theory and Algorithmic
Approaches" [online]
- Hichem Sahbi and Donald Geman, "A Hierarchy of Support Vector
Machines for Pattern Detection", Journal of
Machine Learning Research 7 (2006): 2087--2123
- Erik Sandewall, Features and Fluents: The Representation of
Knowledge about Dynamical systems
- Gerhard Schurz
- "Meta-Induction and the Prediction Game: A New View On Hume's Problem" [PDF preprint]
- "Patterns of Abduction" [PDF preprint]
- Alcino J. Silva, Anthony Landreth, and John Bickle, Engineering the Next Revolution in Neuroscience: The New Science of Experiment Planning
- Aris Spanos
- "Statistical Induction, Severe Testing, and Model
Validation" [Preprint]
- "Revisiting data mining: `hunting' with or without a
license", Journal of Economic Methodology 7
(2000): 231--264 [PDF
reprint]
- Peter Sollich and Anason Halees, "Learning curves for Gaussian
process regression: Approximations and bounds,"
cond-mat/0105015
- Ray Solomonoff's
Papers
- Sonnenberg et al., "The SHOGUN Machine Learning Toolbox",
Journal of Machine Learning Research 11 (2010): 1799--1802
- Eduardo D Sontag, "Adaptation Implies Internal Model,"
math.OC/0203228
- Susanne Still, "Information theoretic approach to interactive learning", arxiv:0709.1948
- Ron Sun and C. L. Giles (eds.), Sequence Learning: Paradigms,
Algorithms, and Applications
- Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.),
Optimization for Machine Learning
- Sebastian Thrun and Lorien Pratt (eds.), Learning to
Learn
- Robert Tibshirani and Larry Wasserman, "Correlation-sharing for
detection of differential gene
expression", math.ST/0608061
["Our proposal averages the univariate scores of each feature with the scores
in correlation neighborhoods. ... The general idea of correlation-sharing can
be applied to other prediction problems involving a large number of correlated
features."]
- Nicholas B. Turk-Browne, Brian J. Scholl, Marvin M. Chun, and Marcia K. Johnson, "Neural Evidence of Statistical Learning; Efficient Detection
of Visual Regularities Without Awareness", Journal of Cognitive Neuroscience 21 (2009): 1934--1945
- Richard Turner, Maneesh Sahani, "A Maximum-Likelihood
Interpretation for Slow Feature Analysis", Neural
Computation
19 (2007): 1022-1038
- Peter D. Turney, "How to shift bias: Lessons from the Baldwin
effect," Evolutionary Computation 4
(1996): 271-295 [online]
- Satoshi Watanabe, Knowing and Guessing: A Quantitative Study
of Inference and Information
- Ying Yang, Xindong Wu and Xingquan Zhu, "Mining in Anticipation for
Concept Change: Proactive-Reactive Prediction in Data
Streams", Data
Mining and Knowledge Discovery 13 (2006): 261--289
- H. Zha, X. He, C. Ding, M. Gu and H. Simon, "Bipartite Graph
Partitioning and Data Clustering,"
cs.IR/0108018
(削除) To be shot after a fair trial (削除ここまで) Color me skeptical:
- Menachem Stern, Arvind Murugan, "Learning without neurons in physical systems", arxiv:2206.05831
To write:
- CRS, Causal Architecture and Model Discovery: Theory,
Algorithms and Examples
- CRS, "Three Kinds of Complexity in Prediction: Induction,
Estimation and Calculation"