Short description:
Hall's CFS is a combinatorial correlation-based feature selection algorithm.
A greedy-best first search strategy is used to identify features with
high correlation to the response variable but low correlation amongst each other
based on the following scoring function:
where S is the selected subset with k features and is the average
feature-class correlation and the average feature-feature correlation.
References:
- Hall, M.A., Correlation-based feature selection for discrete and numeric class machine learning,
Proceedings of the Seventeenth International Conference on Machine Learning (2000), p. 359-366
PLS-CV - Partial Least Squares Cross-Validation
[]
Short description:
The importance of features is estimated based on the magnitudes of the coefficients obtained
from training a Partial Least Squares classifier. The
number of PLS-components n is selected based on the cross-validation accuracies
for 20 random 2/3-partitions of the data for all possible values of n.
We use the PLS-implementation in R by Boulesteix et al.
References:
- Hall, M.A., Correlation-based feature selection for discrete and numeric class machine learning,
Proceedings of the Seventeenth International Conference on Machine Learning (2000), p. 359-366
Significance analysis of microarrays (SAM)
[]
Short description:
SAM (Tusher et al., 2001) is a method to detect differentially expressed genes
that uses permutations of the measurements to assign significance values to selected genes.
Based on the expression level change in relation to the standard deviation across
the measurements a score is calculated for each gene and the genes are filtered
according to a user-adjustable threshold (delta). The False Discovery Rate (FDR),
i.e. the percentage of genes selected by chance, is then estimated from multiple
permutations the measurements. We use the standard SAM-implementation from the samr-package (v1.25).
References:
- Tusher, V., Tibshirani, R. and Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response", PNAS 2001 (98), p. 5116-5121
Empirical Bayes moderated t-test (eBayes)
[]
Short description:
The empirical Bayes moderated t-statistic (eBayes, Loennstedt et al., 2002) ranks genes by
testing whether all pairwise contrasts between different outcome-classes are zero.
An empirical Bayes method is used to shrink the probe-wise sample-variances towards
a common value and to augment the degrees of freedom for the individual variances (Smyth, 2004).
For multiclass problems the F-statistic is computed as an overall test from the t-statistics
for every genetic probe.
We use the eBayes-implementation in the R-package limma (v2.12).
References:
- Loennstedt, I. and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31-46.
- Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 3
RF-MDA - Random Forest Feature Selection
[]
Short description:
A random forest (RF) classifier with 200 trees is applied and
the importance of features is estimated by means of the
mean decrease in accuracy (MDA) for the out-of-bag samples.
We use the RF implementation from the "randomForest" R-package
based on L. Breiman's random forest algorithm.
References:
- Breiman, L. (2001), Random Forests, Machine Learning 45(1), p. 5-32
ENSEMBLE - Ensemble Feature Selection
[]
Short description:
This selection method combines three univariate filters
to an ensemble feature ranking.
The used filters are a correlation filter based on the
absolute Pearson correlation between a feature vector
and the outcome vector, a signal-to-noise-ratio (SNR) filter
which extends the SNR-measure to multiple classes based on the
pairwise SNRs, and an F-score filter.
All filters receive the same weight and the final ranking
is obtained as the sum of the individual ranks.
Golub et al. (1999) Leukemia data set
[]
Short description:
Analysis of patients with acute lymphoblastic leukemia (ALL, 1) or acute myeloid leukemia (AML, 0).
Sample types: ALL, AML
No. of genes: 7129
No. of samples: 72 (class 0: 25, class 1: 47)
Normalization: VSN (Huber et al., 2002)
References:
- Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science (1999), 531-537
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
van't Veer et al. (2002) Breast cancer data set
[]
Short description:
Samples from Breast cancer patients were subdivided in a "good prognosis" (0) and "poor prognosis" (1) group depending on the occurrence of distant metastases within 5 years.
The data set is pre-processed as described in the original paper and was obtained from the R package "DENMARKLAB" (Fridlyand and Yang, 2004).
Sample types: good prognosis, poor prognosis
No. of genes: 4348 (pre-processed)
No. of samples: 97 (class 0: 51, class 1: 46)
Normalization: see reference (van't Veer et al., 2002)
References:
- van't Veer et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature (2002), 415, p. 530-536
- Fridlyand,J. and Yang,J.Y.H. (2004) Advanced microarray data analysis: class discovery and class prediction (http://genome. cbs.dtu.dk/courses/norfa2004/Extras/DENMARKLAB.zip)
Yeoh et al. (2002) Leukemia multi-class data set
[]
Short description:
A multi-class data set for the prediction of the disease subtype in pediatric acute lymphoblastic leukemia (ALL).
No. of genes: 12625
No. of samples: 327
Normalization: VSN (Huber et al., 2002)
References:
- Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. March 2002. 1: 133-143
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
Alon et al. (1999) Colon cancer data set
[]
Short description:
Analysis of colon cancer tissues (1) and normal colon tissues (0).
Sample types: tumour, healthy
No. of genes: 2000
No. of samples: 62 (class 1: 40, class 0: 22)
Normalization: VSN (Huber et al., 2002)
References:
- U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and
A. Levine, 釘road patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide
arrays,? in Proceedings of the National Academy of Science (1999), vol. 96, pp. 6745?6750
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
Singh et al. (2002) Prostate cancer data set
[]
Short description:
Analysis of prostate cancer tissues (1) and normal tissues (0).
Sample types: tumour, healthy
No. of genes: 2135 (pre-processed)
No. of samples: 102 (class 1: 52, class 0: 50)
Normalization: GeneChip RMA (GCRMA)
References:
- D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J.Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D但mico, J.P. Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): pp. 203?209, 2002
- Z. Wu and R.A. Irizarry. Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Journal of Computational Biology, 12(6): pp. 882?893, 2005
Shipp et al. (2002) B-Cell Lymphoma data set
[]
Short description:
Analysis of Diffuse Large B-Cell lymphoma samples (1) and follicular B-Cell lymphoma samples (0).
Sample types: DLBCL, follicular
No. of genes: 2647 (pre-processed)
No. of samples: 77 (class 1: 58, class 0: 19)
Normalization: VSN (Huber et al., 2002)
References:
- M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8(1): pp. 68?74, 2002
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
Shin et al. (2007) T-Cell Lymphoma data set
[]
Short description:
Analysis of cutaneous T-Cell lymphoma (CTCL) samples from lesional skin biopsies. Samples are divided in lower-stage (stages IA and IB, 0) and higher-stage (stages IIB and III) CTCL.
Sample types: lower_stage, higher_stage
No. of genes: 2922 (pre-processed)
No. of samples: 63 (class 1: 20, class 0: 43)
Normalization: VSN (Huber et al., 2002)
References:
- J. Shin, S. Monti, D. J. Aires, M. Duvic, T. Golub, D. A. Jones and T. S. Kuppe, Lesional gene expression profiling in cutaneous T-cell lymphoma reveals natural
clusters associated with disease outcome. Blood, 110(8): pp. 3015, 2007
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
Armstrong et al. (2002) Leukemia data set
[]
Short description:
Comparison of three classes of Leukemia samples: Acute lymphoblastic leukemia (ALL, 0), acute myelogenous leukemia (AML, 1) and ALL with mixed-lineage leukemia gene translocation (MLL, 3).
Sample types: ALL, AML, MLL
No. of genes: 8560 (pre-processed)
No. of samples: 72 (class 0: 24, class 1: 28, class 2: 20)
Normalization: VSN (Huber et al., 2002)
References:
- S.A. Armstrong, J.E Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, S.J. Korsmeyer; MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1): pp. 41?47, 2002
- Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104
SVM - Support Vector Machine
[]
Short description:
Support vector machines (SVMs) belong to the most popular
methods in microarray sample classification. The SVM classifier
differs from other learning algorithm in that it selects
the separating hyperplane with the maximum distance to the closest
samples (the maximum margin hyperplane).
Extensions to the linear SVM like the "soft margin" and the "kernel
trick" allow the classifier to deal with mislabelled samples
and separate non-linear data.
We use the linear kernel C-SVM from the e1071-package,
which is a wrapper for the well-known LibSVM library.
Parameter-optimization is performed via grid-search
in a nested cross-validation routine.
References:
- Dimitriadou, E. and Hornik, K. and Leisch, F. and Meyer, D. and Weingessel, A. and Leisch, M.F. (2005),
Misc functions of the department of statistics (e1071), TU Wien
- C.-C. Chang and C.-J. Lin (2001), LIBSVM}: a library for support vector machines,
http://www.csie.ntu.edu.tw/~cjlin/libsvm
BioHEL
[]
Short description:
BioHEL (Bioinformatics-oriented hierarchical evolutionary learning; J. Bacardit, 2006)
is a rule-based machine learning system using the concept of evolutionary
learning within an interative rule learning (IRL) framework.
The generated models consist of structured
classification rule sets commonly known as "decision lists". These rule sets
are built by iteratively learning new rules based on an almost standard generational Genetic
Algorithm until the combination of rules covers all observations. Each time
a new rule has been learned, the matching samples are removed from the search space.
For a more detailed description of BioHEL, please consult the references.
References:
- Bacardit J, Krasnogor N: BioHEL: Bioinformatics-oriented Hierarchical Evolutionary Learning 2006. [eprints.nottingham.ac.uk]
- Bacardit J, Burke E, Krasnogor N: Improving the scalability of rule-based evolutionary learning. Memetic Computing (to appear)
PAM - Prediction Analysis for Microarrays
[]
Short description:
The Prediction Analysis for Microarrays (PAM; Tibshirani et al., 2002)
method uses the nearest shrunken centroid approach to classify
microarray samples. For each class the centroid is calculated
and shrunken towards the overall centroid for all classes
by a certain amount (depending on a user-defined threshold parameter).
New samples are assigned to the group of the nearest centroid, but
based on the shrunken centroids. The shrinkage reduces the effect
of noisy genes and removes genes from the selection that are shrunken
to zero for all classes.
We use the standard implementation in the pamr-package and choose
the threshold parameter automatically based on nested cross-validation.
References:
- Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression PNAS 99, p. 6567-6572
RF - Random Forest classifier
[]
Short description:
Breiman's random forest (RF) classifier uses an ensemble
of unpruned decision trees to make predictions.
Binary partition-trees are grown partially at random
by selecting different random sub-samples of the training data
for each tree and different random sub-samples of features (size m) at
each node split.
Although the single trees typically only have weak predictive power
the ensemble often provides high accuracies, profiting from
the trees' diversity introduced by the bootstrap sampling routine.
We use the RF implementation from the "randomForest" R-package,
set parameter m to the square root of the total number of predictors and
use an ensemble of 500 trees.
References:
- Breiman, L. (2001), Random Forests, Machine Learning 45(1), p. 5-32
kNN - k-Nearest Neighbor classifier
[]
Short description:
The k-nearest neighbor (kNN) algorithm is one of the simplest machine learning
algorithms, but often performs well in microarray sample classification
tasks. Every new sample is assigned to the majority class of its k nearest
neighbors in the training set (usually based on the Euclidean distance metric).
By means of the parameter k the bias/variance tradeoff can be controlled.
Since kNN does not require a training phase, the algorithm has low runtime
requirements even for high-dimensional data sets.
We determine the parameter k automatically using nested cross-validation.
Help
[]
Features: The class assignment module allows
researchers to assess the prediction accuracy for sample
classification that can be reached with common machine
learning methods for specific types of in-house microarray
data. Please note that trained models are not applicable to
external data from other chip platforms and experimental procedures
(we are planning to integrate cross-study normalization methods
in the future).
The user can choose between four prediction methods (SVM, PAM,
RF, kNN - see the info boxes for each method by clicking
on the question marks) and an ensemble that combines all
algorithms together. The prediction accuracy for these
methods is evaluated based on external two-level cross-validation
(Wood et al., 2007), using nested cross-validation for
automatic parameter optimization and including the feature
selection in the cross-validation scheme. All feature selection
methods from the Gene selection module are available and
can be freely combined with different prediction methods.
Users can either upload their own microarray data (see uploading your own data) or use one of the
pre-processed example data sets.
Settings: The only parameter to be chosen by the user
is the number of cross-validation cycles - all other parameters
are automatically chosen by our prediction server.
Output: The prediction analysis report file for a
submitted analysis contains multiple evaluation statistics to
prevent users from making false conclusions based on single
performance measures. Microarray sample prediction accuracies
often exhibit high variance across the cross-validation cycles
and in many cases the input data suffers from class imbalances;
thus, in addition to the average classification accuracy we
provide standard deviations, 95% confidence limits, sensitivity
and specificity and special statistics like Matthew's
correlation coefficient, Cohen's Kappa and a classification p-
value (Huberty, 1994). Please notice that in many cases (also for
some of our example data sets) the sample size will be too small
to obtain good performance estimates (for the future we are
planning to integrate additional biological data into the
predictions to overcome some of these common limitations). If
a feature selection method has been chosen by the user in
combination with a classifier, the most frequently selected genes
will be shown in the output as a ranked list based on a Z-score
calcuation (see Zhu, 2007). A plot for the Z-scores of frequently
selected genes will help the user to estimate the number of genes
that should be included into a classification model (if only few
genes with significant Z-scores are available, smaller numbers of
genes should be used in prediction models to prevent including
irrelevant genes). Finally, a heatmap for the expression
values of the top-ranked genes indicates whether the selected
genes are relatively up- or down-regulated in different groups of
samples - for a cancer data set this analysis can help the user
to differentiate between potential tumour suppressor genes and
oncogenes in the data (please notice that this is not an
unsupervised analysis).
Uploading your own data: In order to use
ArrayMining.net with your own data there are two possibilities:
Option 1: You can
upload a tab- or
space-delimited text-file containing pre-normalized Microarray
data in the following simple matrix-format (see Fig. 1):
gene expression matrix
You can download an example data file here (use right-click and "Save as"). The
columns must correspond to the samples and the rows to the genes.
The first column contains the gene identifiers (a unique label
per gene) and the last row the class information for the samples
(multiple samples can have the same class label). The rest of the
matrix should contain normalized expression values obtained using
any of the common Microarray normalization methods (e.g. VSN,
RMA, GCRMA, MAS, dChip, etc.). The gene identifiers can be
any one of the following: Affymetrix ID, ENTREZ ID, GENBANK ID.
You can also use your own identifiers; however, in this case you
won't obtain any links to functional annotation data bases. The
class labels can be any alphanumeric strings or symbols
(e.g. "tumour" and "healthy", or "1","2", "3", or "leukemia1",
"leukemia2", "leukemia3", etc.). Samples belonging to the same
class need to have exactly the same class label. The last row
containing the class labels has to begin with a user-defined
"sample type"-label, e.g. "phenotypes", "tumours" or just
"labels". Optionally, unique IDs per sample can be specified
in the first row (if this line is missing, the samples
will be numbered consecutively).
Option 2: You can upload
a compressed ZIP-archive containing Affymetrix CEL-files
and a txt-file containing tab-delimited numerical sample labels (specifying
replicates by the same number, i.e. "1 1 1 2 2 2" for an experiment
with 6 samples, two classes and three samples for both class 1 and class 2)
Please contact us, should you
experience any kind of problems when uploading or analyzing your
data.
ENSEMBLE
[]
Short description:
The ensemble predictor combines the SVM, PAM, RF and kNN algorithms
together to obtain a more robust sample classification.
Samples are assigned to the class that receives the majority of votes
across the algorithms.
For each of the used methods the parameter selection is performed
automatically (see the descriptions of the corresponding algorithms).
Terms and Conditions
[]
This is ArrayMining version 1.0
This service is free for academic and non-commercial use.
If you intend to use results obtained with ArrayMining.net for a publication, please acknowledge/cite our paper:
E. Glaab, J. Garibaldi, N. Krasnogor
ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization
BMC Bioinformatics, Vol. 10, No. 1. (2009), 358.
We cannot give any guarantee that ArrayMining is free of errors or bugs, but we perform integretiy checks and provide validition measures for each analysis module.
If you have any comments or experience problems accessing the server, please do not hesitate to us.
We take reasonable measures to protect your data, which includes your dataset, tasks, results and all further information that you provide us, and we will not make them available to any third party.
We collect some data for statistical purposes, which includes your IP address and the tasks performed. This data is never forwarded to third parties.
All your data, apart from data that we keep for statistical purposes, will be deleted after the expiration time.
Arraymining.net - Newsletter
[]
Stay informed about updates and new features on our website
by joining our newsletter. Your email address remains strictly confidential
and will only be used to inform you about major updates of our web-service (<= 1 email per month).
You can unsubscribe at any time by clicking on the unsubscribe link at the bottom of our e-mails.
Arraymining.net - Newsletter
[]
Thank you for subscribing. A confirmation message will be sent to you soon.
ArrayMining Logo
ArrayMining - Online Microarray Data Mining
Ensemble and Consensus Analysis Methods for Gene Expression Data