David Dowe's data links
[See also
Ray Solomonoff (1926-2009)
85th memorial
conference (Wedn 30 Nov - Fri 2 Dec 2011),
1st
Call for Papers.]
Machine learning, statistics and "data mining" data
U. Calif. Irvine
(
UCI)
ICS
KDD Archive,
Machine Learning
Repository
and
other machine learning
repositories and sites.
NIST (U.S.)'s
Info. Tech. Lab.'s
Statistical Reference
Datasets (StRD)
and
Dataset
archives.
CMU
Dept. of Statistics's
StatLib links,
Datasets Archive
and
"other places" and
statistical archives.
Baylor University
Libraries
Computer
Science
Data
Repositories.
Machine Learning Resources -
Data Repositories and
competitions,
maintained by
David Aha.
Online
Machine Learning Resources:
ML
Benchmarks and other Data Sources.
Bayesian Network
data sets
(or
Bayes Net
data sets) - see also
Bayesian Networks using
MML.
Dept of Computer Science,
University of Toronto's
Data for Evaluating Learning in
Valid Experiments (DELVE)'s
Datasets
Summary Table,
including the
Titanic
dataset.
KDNuggets's
Datasets for
"
Data Mining"
and
"Data Mining"
Competitions.
"The Data Mine"'s
Data Sources.
Rob Hyndman's
Time
Series Data Library
and
CEC2000's
Time
series prediction competitions.
UCR Time Series
Data Mining Archive, linked to by
Eamonn Keogh.
Data on the Web
-
Faculty of Business and Economics,
University of Sydney, Australia.
AskDrMath
(
The Math Forum - Math Library)'s
Data Sets,
Prob/Stat
and
Statistics:
Data Sets.
Brookhaven Protein Database
(
old site)
Gopher;
SWISS-PROT
Protein Sequence Database
and
CSSE
Contig Restriction Site Mapping
and
links
(
human genome project, etc.).
Kathleen Cuningham Foundation Consortium
for research into Familial Breast cancer
(
http://www.kconfab.org)'s
policies and procedures
for accessing kConFab data.
European Pulsar Network
Data
Archive
(and
mirror
site)
(and
disclaimer)'s
index
(and
Russell Edwards's
comments):
Data
Archive.
Statistical Society of Canada's
Case
Studies in Data Analysis for 2000.
Bayesian
networks repository
(started by Nir Friedman);
Bayesian
networks
and
Related
sites.
University of Fribourg
Section of Chemistry's
Useful Chemistry Links
and databases.
ICMAS-2000:
market game and
ICMAS-00 Trading Agent Competition
Overview.
Linguistic Data Consortium
(
LDC):
LDC-Online,
LDC Catalog(ue),
Obtaining corpora and
Search LDC Web site.
Links to
text analysis resources.
Geoff McLachlan
and
David Peel's
"
Finite Mixture Models"
and
data sets.
Australian Antarctic Division
(
AAD) and
Australian Antarctic Data Centre.
Search and Rescue
Data collection form
(
HTML,
Word97,
postscript,
pdf) -
Charles Twardy.
Competitions
Machine Learning Resources -
Data Repositories and
competitions,
maintained by
David Aha.
KDNuggets's
Datasets for
"
Data Mining"
and
"Data Mining"
Competitions.
Rob Hyndman's
Time
Series Data Library
and
CEC2000's
Time
series prediction competitions.
ICMAS-2000:
market game and
ICMAS-00 Trading Agent Competition
Overview.
KDD Cup 2000, e-mail:
kddcup2000@bluemartini.com.
This is the homepage
of The Insurance Company (TIC) Benchmark.
Other data
Some links to
chess and games data.
Sports:
Australian Rules football
with
data since 1993,
data since 1998,
other footy statistics
and
some other sports data.
Medical links (with some
Medical data links),
and
EEG data
(
electroencephalograph
data) from
UCI KDD Archive
(
http://kdd.ics.uci.edu).
www.statoo.com:
"the portal to statistics on the internet" (so they say).
Links to Random number generation software
(Pseudo-)
Random
number generation software in Fortran :
uniform (for multinomial),
Gaussian
(Normal),
von Mises
(circular) and
Poisson.
Random number generation (and other) publications by Chris Wallace:
TR #89/123 (Feb. 1989), 1990, 1996.
http://www.almaden.ibm.com/cs/quest:
synthetic market-basket dataset
generator.
http://www.almaden.ibm.com/cs/people/bayardo/vinci/maxminer.html:
max-miner
algorithm, which generates frequent itemsets,
in order to test your algorithm output.
(Use the FINDALL option, unless you want only the maximal frequent itemsets.)
http://lib.stat.cmu.edu/DASL/DataArchive.html.
Random number
(generator)s and Monte Carlo methods:
Information
Servers,
Theory,
Applications
and
Software.
Other RNG software:
" C Programming " ; " Code Snippets " ;
" Portable functions and headers ";
" Random number functions " ; " Rand1.C ".
Data analysis and ``data mining''
Minimum Message Length
(
MML),
an operational form of
Occam's razor
[see also
Minimum Description Length,
MDL].
Clustering,
mixture modelling
and
unsupervised learning.
Miscellaneous, other, links
Chris Wallace (1933-2004)
(developer of
MML in
1968),
Wallace, C.S. (2005) [posthumous],
Statistical and Inductive Inference by Minimum Message
Length, Springer (Series: Information Science and Statistics), 2005, XVI,
432 pp., 22 illus., Hardcover, ISBN: 0-387-23795-X
[
table of
contents,
chapter headings and
more],
Wallace, C.S. (with D. L. Dowe),
"
Minimum
Message Length and Kolmogorov complexity",
Comp.
J., Vol 42, No. 4 (1999),
pp270-283
[this article is the Computer Journal's most downloaded ``full text as
.pdf'' - see, e.g.,
here],
Bayesian networks using
MML,
clustering and mixture modelling,
decision trees and
decision graphs
using
MML,
"Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric
Languages",
by J. W. Comley and
D.L. Dowe;
Chapter 11
(pp
265-
294)
in P. Grunwald, M. A. Pitt and I. J. Myung (eds.),
Advances in Minimum Description Length:
Theory and Applications,
M.I.T. Press, April 2005, ISBN 0-262-07262-9.
{This is about Generalised Bayesian nets (or even the special case of hybrid
Bayesian nets), generalising MML Bayesian nets or
MML Bayesian networks or
MML Bayes nets; and it deals with
a mix of both continuous and discrete variables.
(See also
Comley and Dowe
(2003),
.pdf.)}
Occam's razor
(
Ockham's razor),
Snob
(program for
MML
clustering and mixture modelling),
(
econometric)
time series
using
MML,
medical research,
a probabilistic sports prediction
competition
(and
further reading on probabilistic
scoring),
chess and game theory research,
TheHungerSite,
TheRainforestSite,
"
do-goody"/"
do-goody stuff, improving the world and saving the planet".
Please e-mail me if you would like to know more.
This page,
http://www.csse.monash.edu.au/~dld/datalibrary.html ,
was last updated no earlier than 18th Apr. 2000.
Copyright
David L. Dowe,
Monash University, Australia,
15 March 2000.