Abstract
Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).
References
Alizadeh, A., M. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, H. Sabet, T. Tran, X. Yu, J. Powell, L. Yang, G. Marti, T. Moore, J. Hudson, L. Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. Brown and L. Staudt (2000): “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, 403, 503–511.10.1038/35000501Search in Google Scholar PubMed
Calinski, T. and J. Harabasz (1974): “A dendrite method for cluster analysis,” Commun. Stat., 3, 1–27.Search in Google Scholar
de Souto, M., I. Costa, D. de Araujo, T. Ludermir and A. Schliep (2008): “Clustering cancer gene expression data: a comparative study,” BMC Bioinformatics, 9, 497.10.1186/1471-2105-9-497Search in Google Scholar PubMed PubMed Central
Dudoit, S. and J. Fridlyand (2002): “A prediction-based resampling method for estimating the number of clusters in a dataset,” Genome Biol., 3, research0036.1–research0036.21.Search in Google Scholar
Fowlkes, E. and C. Mallows (1983): “A method for comparing two hierarchical clusterings,” J. Am. Stat. Assoc., 78, 553–569.Search in Google Scholar
Giancarlo, R., D. Scaturro and F. Utro (2008): “Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer,” BMC Bioinformatics, 9, 462.10.1186/1471-2105-9-462Search in Google Scholar PubMed PubMed Central
Hamerly, G. and C. Elkan (2003): “Learning the k in k-means.” In Neural Information Processing Systems. MIT Press, 2003.Search in Google Scholar
Hartigan, J. (1975): Clustering algorithms, New York: John Wiley and Sons.Search in Google Scholar
Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218.Search in Google Scholar
Kalogeratos, A. and A. Likas (2012): “Dip-means: an incremental clustering method for estimating the number of clusters.” In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, NIPS, MIT Press.Search in Google Scholar
Kaufman, L. and P. Rousseeuw (1990): Finding groups in data: An introduction to cluster analysis, New York: John Wiley and Sons.10.1002/9780470316801Search in Google Scholar
Krzanowski, W. and Y. Lai (1988): “A criterion for determining the number of groups in a data set using sum-of-squares clustering,” Biometrics, 44, 23–34.10.2307/2531893Search in Google Scholar
Milligan, G. and M. Cooper (1988): “A study of standardization of variables in cluster analysis,” J. Classif., 5, 181–204.Search in Google Scholar
Nilsen, G., K. Liestøl, P. Van Loo, H. K. M. Vollan, M. Eide, O. Rueda, S.-F. Chin, R. Russell, L. Baumbusch, C. Caldas, A.-L. Børresen-Dale and O. C. Lingjærde (2012): “Copynumber: efficient algorithms for single- and multi-track copy number segmentation,” BMC Genomics 13, 591.10.1186/1471-2164-13-591Search in Google Scholar PubMed PubMed Central
Peng, Y., Y. Zhang, G. Kou and Y. Shi (2012): “A multicriteria decision making approach for estimating the number of clusters in a data set,” PLoS One, 7, e41713.10.1371/journal.pone.0041713Search in Google Scholar PubMed PubMed Central
Perou, C., T. Sørlie, M. Eisen, M. van de Rijn, S. Jeffrey, C. Rees, J. Pollack, D. Ross, H. Johnsen, L. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. Zhu, P. Lonning, A. Børresen-Dale, P. Brown and D. Botstein (2000): “Molecular portraits of human breast tumours,” Nature, 406, 747–752.10.1038/35021093Search in Google Scholar PubMed
Pollard, K. S. and M. J. van der Laan (2002): “A Method to Identify Significant Clusters in Gene Expression Data,” U.C. Berkeley Division of Biostatistics, Working Paper Series, 107.Search in Google Scholar
Schlicker, A., G. Beran, C. Chresta, G. McWalter, A. Pritchard, S. Weston, S. Runswick, S. Davenport, K. Heathcote, D. A. Castro, G. Orphanides, T. French and L. F. Wessels (2012): “Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines,” BMC Med. Genomics, 5, 66.Search in Google Scholar
Sørlie, T., C. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. Eisen, M. van de Rijn, S. Jeffrey, T. Thorsen, H. Quist, J. Matese, P. Brown, D. Botstein, P. Lønning and A. Børresen-Dale (2001): “Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications,” Proc Natl Acad Sci USA, 98, 10869–10874.10.1073/pnas.191367098Search in Google Scholar PubMed PubMed Central
Tibshirani, R. and G. Walther (2005): “Cluster validation by prediction strength,” J. Comput. Graph. Stat., 14, 511–528.Search in Google Scholar
Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. Roy. Stat. Soc. B, 63, 411–423.Search in Google Scholar
van Rijsbergen, C. (1979): Information retrieval, 2nd ed., London: Butterworths.Search in Google Scholar
Yan, M. and K. Ye (2007): “Determining the number of clusters using the weighted gap statistic,” Biometrics, 63, 1031–1037.10.1111/j.1541-0420.2007.00784.xSearch in Google Scholar PubMed
©2013 by Walter de Gruyter Berlin Boston
Articles in the same Issue
- Masthead
 - Masthead
 - Research Articles
 - Simultaneous inference and clustering of transcriptional dynamics in gene regulatory networks
 - Markov chain Monte Carlo sampling of gene genealogies conditional on unphased SNP genotype data
 - Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial
 - Optimizing threshold-schedules for sequential approximate Bayesian computation: applications to molecular systems
 - Model selection for prognostic time-to-event gene signature discovery with applications in early breast cancer data
 - Identifying clusters in genomics data by recursive partitioning
 
Articles in the same Issue
- Masthead
 - Masthead
 - Research Articles
 - Simultaneous inference and clustering of transcriptional dynamics in gene regulatory networks
 - Markov chain Monte Carlo sampling of gene genealogies conditional on unphased SNP genotype data
 - Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial
 - Optimizing threshold-schedules for sequential approximate Bayesian computation: applications to molecular systems
 - Model selection for prognostic time-to-event gene signature discovery with applications in early breast cancer data
 - Identifying clusters in genomics data by recursive partitioning