Home Identifying clusters in genomics data by recursive partitioning
Article
Licensed
Unlicensed Requires Authentication

Identifying clusters in genomics data by recursive partitioning

  • Gro Nilsen , Ørnulf Borgan , Knut LiestØl and Ole Christian Lingjærde EMAIL logo
Published/Copyright: August 13, 2013

Abstract

Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).


Corresponding author: Ole Christian Lingjærde, Biomedical Informatics, Department of Informatics, University of Oslo, Postboks 1080 Blindern, 0316 Oslo, Norway; Centre for Cancer Biomedicine, University of Oslo, Norway; and K.G. Jebsen Centre for Breast Cancer Research, Oslo University Hospital, Oslo, Norway, e-mail:

References

Alizadeh, A., M. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, H. Sabet, T. Tran, X. Yu, J. Powell, L. Yang, G. Marti, T. Moore, J. Hudson, L. Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. Brown and L. Staudt (2000): “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, 403, 503–511.10.1038/35000501Search in Google Scholar PubMed

Calinski, T. and J. Harabasz (1974): “A dendrite method for cluster analysis,” Commun. Stat., 3, 1–27.Search in Google Scholar

de Souto, M., I. Costa, D. de Araujo, T. Ludermir and A. Schliep (2008): “Clustering cancer gene expression data: a comparative study,” BMC Bioinformatics, 9, 497.10.1186/1471-2105-9-497Search in Google Scholar PubMed PubMed Central

Dudoit, S. and J. Fridlyand (2002): “A prediction-based resampling method for estimating the number of clusters in a dataset,” Genome Biol., 3, research0036.1–research0036.21.Search in Google Scholar

Fowlkes, E. and C. Mallows (1983): “A method for comparing two hierarchical clusterings,” J. Am. Stat. Assoc., 78, 553–569.Search in Google Scholar

Giancarlo, R., D. Scaturro and F. Utro (2008): “Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer,” BMC Bioinformatics, 9, 462.10.1186/1471-2105-9-462Search in Google Scholar PubMed PubMed Central

Hamerly, G. and C. Elkan (2003): “Learning the k in k-means.” In Neural Information Processing Systems. MIT Press, 2003.Search in Google Scholar

Hartigan, J. (1975): Clustering algorithms, New York: John Wiley and Sons.Search in Google Scholar

Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218.Search in Google Scholar

Kalogeratos, A. and A. Likas (2012): “Dip-means: an incremental clustering method for estimating the number of clusters.” In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, NIPS, MIT Press.Search in Google Scholar

Kaufman, L. and P. Rousseeuw (1990): Finding groups in data: An introduction to cluster analysis, New York: John Wiley and Sons.10.1002/9780470316801Search in Google Scholar

Krzanowski, W. and Y. Lai (1988): “A criterion for determining the number of groups in a data set using sum-of-squares clustering,” Biometrics, 44, 23–34.10.2307/2531893Search in Google Scholar

Milligan, G. and M. Cooper (1988): “A study of standardization of variables in cluster analysis,” J. Classif., 5, 181–204.Search in Google Scholar

Nilsen, G., K. Liestøl, P. Van Loo, H. K. M. Vollan, M. Eide, O. Rueda, S.-F. Chin, R. Russell, L. Baumbusch, C. Caldas, A.-L. Børresen-Dale and O. C. Lingjærde (2012): “Copynumber: efficient algorithms for single- and multi-track copy number segmentation,” BMC Genomics 13, 591.10.1186/1471-2164-13-591Search in Google Scholar PubMed PubMed Central

Peng, Y., Y. Zhang, G. Kou and Y. Shi (2012): “A multicriteria decision making approach for estimating the number of clusters in a data set,” PLoS One, 7, e41713.10.1371/journal.pone.0041713Search in Google Scholar PubMed PubMed Central

Perou, C., T. Sørlie, M. Eisen, M. van de Rijn, S. Jeffrey, C. Rees, J. Pollack, D. Ross, H. Johnsen, L. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. Zhu, P. Lonning, A. Børresen-Dale, P. Brown and D. Botstein (2000): “Molecular portraits of human breast tumours,” Nature, 406, 747–752.10.1038/35021093Search in Google Scholar PubMed

Pollard, K. S. and M. J. van der Laan (2002): “A Method to Identify Significant Clusters in Gene Expression Data,” U.C. Berkeley Division of Biostatistics, Working Paper Series, 107.Search in Google Scholar

Schlicker, A., G. Beran, C. Chresta, G. McWalter, A. Pritchard, S. Weston, S. Runswick, S. Davenport, K. Heathcote, D. A. Castro, G. Orphanides, T. French and L. F. Wessels (2012): “Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines,” BMC Med. Genomics, 5, 66.Search in Google Scholar

Sørlie, T., C. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. Eisen, M. van de Rijn, S. Jeffrey, T. Thorsen, H. Quist, J. Matese, P. Brown, D. Botstein, P. Lønning and A. Børresen-Dale (2001): “Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications,” Proc Natl Acad Sci USA, 98, 10869–10874.10.1073/pnas.191367098Search in Google Scholar PubMed PubMed Central

Tibshirani, R. and G. Walther (2005): “Cluster validation by prediction strength,” J. Comput. Graph. Stat., 14, 511–528.Search in Google Scholar

Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. Roy. Stat. Soc. B, 63, 411–423.Search in Google Scholar

van Rijsbergen, C. (1979): Information retrieval, 2nd ed., London: Butterworths.Search in Google Scholar

Yan, M. and K. Ye (2007): “Determining the number of clusters using the weighted gap statistic,” Biometrics, 63, 1031–1037.10.1111/j.1541-0420.2007.00784.xSearch in Google Scholar PubMed

Published Online: 2013-08-13
Published in Print: 2013-10-01

©2013 by Walter de Gruyter Berlin Boston

Downloaded on 4.11.2025 from https://www.degruyterbrill.com/document/doi/10.1515/sagmb-2013-0016/pdf
Scroll to top button