ArrayMining  Online Microarray Data Mining
Ensemble and Consensus Analysis Methods for Gene Expression Data
Contact  [X] 
The project and the website are maintained by Enrico Glaab School of Computer Science, Nottingham University, UK webmaster@arraymining.net 

Close 
Golub et al. (1999) Leukemia data set  [X] 
Short description: Analysis of patients with acute lymphoblastic leukemia (ALL, 1) or acute myeloid leukemia (AML, 0). Sample types: ALL, AML No. of genes: 7129 No. of samples: 72 (class 0: 25, class 1: 47) Normalization: VSN (Huber et al., 2002) References:  Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science (1999), 531537  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
van't Veer et al. (2002) Breast cancer data set  [X] 
Short description: Samples from Breast cancer patients were subdivided in a "good prognosis" (0) and "poor prognosis" (1) group depending on the occurrence of distant metastases within 5 years. The data set is preprocessed as described in the original paper and was obtained from the R package "DENMARKLAB" (Fridlyand and Yang, 2004). Sample types: good prognosis, poor prognosis No. of genes: 4348 (preprocessed) No. of samples: 97 (class 0: 51, class 1: 46) Normalization: see reference (van't Veer et al., 2002) References:  van't Veer et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature (2002), 415, p. 530536  Fridlyand,J. and Yang,J.Y.H. (2004) Advanced microarray data analysis: class discovery and class prediction (http://genome. cbs.dtu.dk/courses/norfa2004/Extras/DENMARKLAB.zip) 
Yeoh et al. (2002) Leukemia multiclass data set  [X]  
Short description: A multiclass data set for the prediction of the disease subtype in pediatric acute lymphoblastic leukemia (ALL).
No. of samples: 327 Normalization: VSN (Huber et al., 2002) References:  Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. March 2002. 1: 133143  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
Alon et al. (1999) Colon cancer data set  [X] 
Short description: Analysis of colon cancer tissues (1) and normal colon tissues (0). Sample types: tumour, healthy No. of genes: 2000 No. of samples: 62 (class 1: 40, class 0: 22) Normalization: VSN (Huber et al., 2002) References:  U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” in Proceedings of the National Academy of Science (1999), vol. 96, pp. 6745–6750  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
Singh et al. (2002) Prostate cancer data set  [X] 
Short description: Analysis of prostate cancer tissues (1) and normal tissues (0). Sample types: tumour, healthy No. of genes: 2135 (preprocessed) No. of samples: 102 (class 1: 52, class 0: 50) Normalization: GeneChip RMA (GCRMA) References:  D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J.Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): pp. 203–209, 2002  Z. Wu and R.A. Irizarry. Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Journal of Computational Biology, 12(6): pp. 882–893, 2005 
Shipp et al. (2002) BCell Lymphoma data set  [X] 
Short description: Analysis of Diffuse Large BCell lymphoma samples (1) and follicular BCell lymphoma samples (0). Sample types: DLBCL, follicular No. of genes: 2647 (preprocessed) No. of samples: 77 (class 1: 58, class 0: 19) Normalization: VSN (Huber et al., 2002) References:  M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, et al. Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nature Medicine, 8(1): pp. 68–74, 2002  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
Shin et al. (2007) TCell Lymphoma data set  [X] 
Short description: Analysis of cutaneous TCell lymphoma (CTCL) samples from lesional skin biopsies. Samples are divided in lowerstage (stages IA and IB, 0) and higherstage (stages IIB and III) CTCL. Sample types: lower_stage, higher_stage No. of genes: 2922 (preprocessed) No. of samples: 63 (class 1: 20, class 0: 43) Normalization: VSN (Huber et al., 2002) References:  J. Shin, S. Monti, D. J. Aires, M. Duvic, T. Golub, D. A. Jones and T. S. Kuppe, Lesional gene expression profiling in cutaneous Tcell lymphoma reveals natural clusters associated with disease outcome. Blood, 110(8): pp. 3015, 2007  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
Armstrong et al. (2002) Leukemia data set  [X] 
Short description: Comparison of three classes of Leukemia samples: Acute lymphoblastic leukemia (ALL, 0), acute myelogenous leukemia (AML, 1) and ALL with mixedlineage leukemia gene translocation (MLL, 3). Sample types: ALL, AML, MLL No. of genes: 8560 (preprocessed) No. of samples: 72 (class 0: 24, class 1: 28, class 2: 20) Normalization: VSN (Huber et al., 2002) References:  S.A. Armstrong, J.E Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, S.J. Korsmeyer; MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1): pp. 41–47, 2002  Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96104 
KMeans clustering algorithm  [X] 
Short description: The kmeans clustering algorithm (Hartigan and Wong, 1979) is one of the simplest partitionbased clustering methods. The data points are partitioned into a given number of k clusters such that the sum of squares from the points to the closest cluster centre is minimized. The algorithm starts with a random cluster centre assignment and then iteratively repeats the following twostep procedure: First, all data points are assigned to the closest cluster centre, then the position of the cluster centres is updated by calculating the centroid of the data points for each cluster. References:  J. A. Hartigan, M. A. Wong, A Kmeans clustering algorithm, JR Stat. Soc. Ser. CAppl. Stat, p. 100108, 28, 1979 
Partitioning Around Medoids  [X] 
Short description: The "Partioning around medoids"algorithm (PAM) is a robust altnerative to the partitionbased kmeans clustering algorithm. PAM searches for k representative objects ("medoids") instead of cluster centres and minimizes a sum of dissimilarities instead of a sum of squared euclidean distances. In contrast to kmeans the algorithm also accepts a dissimilarity matrix as input.For a complete description of the PAM algorithm see chapter 2 of Kaufman and Rousseeuw (1990). References:  L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York, 1990 
SelfOrganizing Maps  [X] 
Short description: Kohonen's SelfOrganizing Maps (SOM) is a partitionbased clustering algorithm which combines clustering and dimension reduction. The data points are mapped into a lowdimensional grid based on their closeness to "prototype" grid points. A SOM can be generated in an online algorithm by iteratively moving prototype grid points that are close to a randomly chosen data point closer towards it based on a given learning rate parameter. A detailed explanation of the algorithm can be found in Kohonen's book "SelfOrganizing Maps" (1995). References:  Kohonen, T., SelfOrganizing Maps. SpringerVerlag, 1995 
Selforganizing Tree Algorithm  [X] 
Short description: The SelfOrganizing Tree Algorithm (SOTA) builds an unsupervised neural network with a binary tree topology. The algorithm iteratively splits the node with the largest diversity into two child nodes ("cells") until the maximum number of clusters is reached. SOTA can be understood as a combination Kohonen's SelfOrganizing Maps (SOM) and hierarchical clustering, which enables the algorithm to capture a hierarchical structure in the data. Further details about the algorithm can be found in the reference paper (Herrero, 2005). References:  Herrero, J., Valencia, A, and Dopazo, J. (2005). A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17, 126136 
Divisive Analysis Clustering  [X] 
Short description: Divisive Analysis Clustering (DIANA) is a divisive hierarchical clustering algorithm, starting with a single cluster for all observations and iteratively dividing it. In each step the cluster with the largest diameter is chosen and divided by identifying the most disparate observation, moving it to a newly created cluster and reassigning the observations that are closer to this new cluster than to their old one. A complete description is available in chapter 6 of Kaufman and Rousseeuw (1990). References:  L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York, 1990 
Hybrid Hierarchical Clustering  [X] 
Short description: Hybrid Hierarchical Clustering is a combination of agglomerative and devisive hierarchical clustering. The algorithm first uses a bottumup hierarchical clustering to identify maximum "mutual clusters", where a mutual cluster is a group of data points such that the largest distance between any pair among them is shorter than the shortest distance to any point outside the group. In the second step, topdown clustering is applied on the data, limited by the constraint that mutual clusters may not be broken. A detailed description of the algorithm is given in Chipman and Tibshirani's paper (2006). References:  H. Chipman, R. Tibshirani, Hybrid Hierarchical Clustering with Applications to Microarray Data, Biostatistics, 7, p. 302317 
Hierachical clustering  [X] 
Short description: Hierarchical clustering (HCL) is one of the most popular and traditional clustering methods and provides easily interpretable treevisualizations (dendrograms) of the clustering results. Agglomerative variants of the HCL algorithm iteratively combine clusters with smallest betweencluster dissimilarity (BCD), whereas divisive variants of HCL iteratively divide the cluster with maximum BCD. In contrast to most partitionbased clustering algorithms HCL does not require the user to specify the number of clusters as an input parameter, but the method is limited by the assumption that the data has a hierarchical structure. For more details about HCL and different BCD measures see Hartigan, "Clustering Algorithms" (1975). References:  J. A. Hartigan (1975). Clustering Algorithms. New York: Wiley 
Combination of all methods  [X] 
Short description: The "ALL"option applies all implemented clustering algorithms (kMeans, PAM, SOM, SOTA, HCL, DIANA and AGNES) individually to the selected data set. It provides tables and plots for different cluster validaty measures to compare the results for different clustering algorithms and different numbers of clusters. This option can help the user to identify the optimal number of clusters and choose the most suitable clustering method for a specific data set (however, please note that the runtime for the "ALL"option is of course longer than for any single clustering algorithm). References: (see references for single methods) 
Help  [X] 
Close 
Terms and Conditions  [X] 
Close 
Arraymining.net  Newsletter  [X] 
Stay informed about updates and new features on our website by joining our newsletter. Your email address remains strictly confidential and will only be used to inform you about major updates of our webservice (<= 1 email per month). You can unsubscribe at any time by clicking on the unsubscribe link at the bottom of our emails. 
Arraymining.net  Newsletter  [X] 

Gene Expression Omnibus (GEO) data base  [X] 
Short description: The Gene Expression Omnibus (GEO) data base is the largest public microarray data base containing samples from more than 150,000 studies. Every GDSentry in GEO can be analysed on ArrayMining.net based on the corresponding identifier. Please note that this feature is still experimental and is likely to require longer runtimes than the analysis of prenormalized data sets. References:  Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles  database and tools update. Nucleic Acids Res, 35, D760, 2007 
Class Discovery Analysis (Unsupervised Learning)
The module below allows you to perform a clustering analysis for a prenormalized input matrix.
To obtain instructions click help.