Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures

Devin C. Koestler; Brock C. Christensen; Carmen J. Marsit; Karl T. Kelsey; E. Andres Houseman

doi:10.1515/sagmb-2012-0068

Artikel

Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures

Devin C. Koestler , Brock C. Christensen , Carmen J. Marsit , Karl T. Kelsey und E. Andres Houseman

Veröffentlicht/Copyright: 5. März 2013

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Statistical Applications in Genetics and Molecular Biology Band 12 Heft 2

Abstract

DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.

Keywords: finite mixture models epigenetics; genomic data; model-based clustering

Corresponding author: Devin C. Koestler, Department of Community and Family Medicine, Geisel School of Medicine at Dartmouth, 1 Medical Center Dr., Lebanon, NH 03756, USA, Tel.: +1 7166736961

Appendix

The formula below makes explicit two facts about the Euclidean metric: (1) it remains unaffected by autocorrelated loci (since its expectation depends on the variance-covariance matrix only through the diagonal); and (2) it is influenced by all loci, including those that are non-informative and possibly noisy (with noisy loci contributing the most, even if they are not informative).

For independent random vectors Y₁ and Y₂,

With δ_j=1(θ₁_j=θ₂_j), the following equations make clear that in correctly-specified mixture models, non-informative loci have no influence on classification (via posterior class membership probability):

where Consequently, terms that depend on factor out of the empirical Bayes formula for classification via posterior class membership probability:

Code for implementing the proposed methods was written in the R statistical language (http://cran.r-project.org/)and be found on the first author’s website (http://bio-epi.hitchcock.org/faculty/koestler.html). Instructions for downloading and usage are provided there.

References

Arcones, M. and E. Gine (1992): “On the bootstrap of u and v statistics,” Ann. Stat., 20(2), 655–674.Suche in Google Scholar

Banister, C. E., D. C. Koestler, M. A. Maccani, J. F. Padbury, E. A. Houseman and C. J. Marsit (2011): “Infant growth restriction is associated with distinct patterns of dna methylation in human placentas,” Epigenetics, 6, 920–927, URL http://dx.doi.org/10.4161/epi.6.7.16079. 10.4161/epi.6.7.16079Suche in Google Scholar PubMed PubMed Central

Breslow, N. E. and D. G. Clayton (1993): “Approximate inference in generalized linear mixed models,” J. Am. Stat. Assoc., 88, 9–25.Suche in Google Scholar

Chen, J. (1995): “Optimal rate of convergence for finite mixture models,” Ann. Stat., 23, 221–233.Suche in Google Scholar

Christensen, B. C., E. A. Houseman, J. J. Godleski, C. J. Marsit, J. L. Longacker, C. R. Roelofs, M. R. Karagas, M. R. Wrensch, R.-F. Yeh, H. H. Nelson, J. L. Wiemels, S. Zheng, J. K. Wiencke, R. Bueno, D. J. Sugarbaker and K. T. Kelsey (2009): “Epigenetic profiles distinguish pleural mesothelioma from normal pleura and predict lung asbestos burden and clinical outcome,” Cancer Res., 69, 227–234, URL http://dx.doi.org/10.1158/0008-5472.CAN-08-2586.10.1158/0008-5472.CAN-08-2586Suche in Google Scholar PubMed PubMed Central

Christensen, B. C., A. A. Smith, S. Zheng, D. C. Koestler, E. A. Houseman, C. J. Marsit, J. L. Wiemels, H. H. Nelson, M. R. Karagas, M. R. Wrensch, K. T. Kelsey and J. K. Wiencke (2011): “Dna methylation, isocitrate dehydrogenase mutation, and survival in glioma,” J. Natl. Cancer Inst., 103, 143–153, URL http://dx.doi.org/10.1093/jnci/djq497.10.1093/jnci/djq497Suche in Google Scholar PubMed PubMed Central

Dasgupta, A. and A. Raftery (1998): “Detecting features in spatial point processes with clutter via model-based clustering,” J. Am. Stat. Assoc., 93, 294–302.Suche in Google Scholar

Dempster, A., N. Laird and D. Rubin (1977): “Maximum likelihhod from incomplete data via the em algorithm,” J. R. Stat. Soc. B, 39, 1–38.Suche in Google Scholar

Ehrich, M., J. Turner, P. Gibbs, L. Lipton, M. Giovanneti, C. Cantor and D. van den Boom (2008): “Cytosine methylation profiling of cancer cell lines,” Proc. Natl. Acad. Sci. USA, 105, 4844–4849, URL http://dx.doi.org/10.1073/pnas.0712251105.10.1073/pnas.0712251105Suche in Google Scholar PubMed PubMed Central

Fraley, C. and A. Raftery (2002): “Model-based clustering, discriminant analysis, and density estimation,” J. Am. Stat. Assoc., 97(458), 611–631.10.1198/016214502760047131Suche in Google Scholar

Grigoriu, A., J. C. Ferreira, S. Choufani, D. Baczyk, J. Kingdom and R. Weksberg (2011): “Cell specific patterns of methylation in the human placenta,” Epigenetics, 6, 368–379.10.4161/epi.6.3.14196Suche in Google Scholar PubMed PubMed Central

Hinoue, T., D. J. Weisenberger, C. P. E. Lange, H. Shen, H.-M. Byun, D. V. D. Berg, S. Malik, F. Pan, H. Noushmehr, C. M. van Dijk, R. A. E. M. Tollenaar and P. W. Laird (2012): “Genome-scale analysis of aberrant dna methylation in colorectal cancer,” Genome Res., 22, 271–282, URL http://dx.doi.org/10.1101/gr.117523.110.10.1101/gr.117523.110Suche in Google Scholar PubMed PubMed Central

Houseman, E. and B. Coull (2004): “Cholesky residuals for assessing normal errors in a linear model with correlated outcomes,” J. Am. Stat. Assoc., 99(466), 383–394.10.1198/016214504000000403Suche in Google Scholar

Houseman, E. A., B. C. Christensen, R.-F. Yeh, C. J. Marsit, M. R. Karagas, M. Wrensch, H. H. Nelson, J. Wiemels, S. Zheng, J. K. Wiencke and K. T. Kelsey (2008): “Model-based clustering of dna methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions,” BMC Bioinformatics, 9, 365, URL http://dx.doi.org/10.1186/1471-2105-9-365.10.1186/1471-2105-9-365Suche in Google Scholar PubMed PubMed Central

Houseman, E. A., B. C. Christensen, M. R. Karagas, M. R. Wrensch, H. H. Nelson, J. L. Wiemels, S. Zheng, J. K. Wiencke, K. T. Kelsey and C. J. Marsit (2009): “Copy number variation has little impact on bead-arraybased measures of dna methylation,” Bioinformatics, 25, 1999–2005, URL http://dx.doi.org/10.1093/bioinformatics/btp364.10.1093/bioinformatics/btp364Suche in Google Scholar PubMed PubMed Central

Houshdaran, S., S. Hawley, C. Palmer, M. Campan, M. N. Olsen, A. P. Ventura, B. S. Knudsen, C. W. Drescher, N. D. Urban, P. O. Brown and P. W. Laird (2010): “Dna methylation profiles of ovarian epithelial carcinoma tumors and cell lines,” PLoS One, 5, e9359, URL http://dx.doi.org/10.1371/journal.pone.0009359.10.1371/journal.pone.0009359Suche in Google Scholar PubMed PubMed Central

Ji, Y., C. Wu, P. Liu, J. Wang and K. R. Coombes (2005): “Applications of beta-mixture models in bioinformatics,” Bioinformatics, 21, 2118–2122, URL http://dx.doi.org/10.1093/bioinformatics/bti318.10.1093/bioinformatics/bti318Suche in Google Scholar PubMed

Joubert, B. R., S. E. Hberg, R. M. Nilsen, X. Wang, S. E. Vollset, S. K. Murphy, Z. Huang, C. Hoyo, O. Midttun, L. A. Cupul-Uicab, P. M. Ueland, M. C. Wu, W. Nystad, D. A. Bell, S. D. Peddada and S. J. London (2012): “450k epigenome-wide scan identifies differential dna methylation in newborns related to maternal smoking during pregnancy,” Environ. Health Perspect., 120(10), 1425–1431, URL http://dx.doi.org/10.1289/ehp.1205412.10.1289/ehp.1205412Suche in Google Scholar PubMed PubMed Central

Kennedy, W. and J. Gentle (1980): Statistical computing, Marcel Dekker, New York.Suche in Google Scholar

Koestler, D. C., C. J. Marsit, B. C. Christensen, M. R. Karagas, R. Bueno, D. J. Sugarbaker, K. T. Kelsey and E. A. Houseman (2010): “Semi-supervised recursively partitioned mixture models for identifying cancer subtypes,” Bioinformatics, 26, 2578–2585, URL http://dx.doi.org/10.1093/bioinformatics/btq470.10.1093/bioinformatics/btq470Suche in Google Scholar PubMed PubMed Central

Koestler, D. C., C. J. Marsit, B. C. Christensen, W. Accomando, S. M. Langevin, E. A. Houseman, H. H. Nelson, M. R. Karagas, J. K. Wiencke and K. T. Kelsey (2012): “Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers,” Cancer Epidemiol. Biomarkers Prev., 21, 1293–1302, URL http://dx.doi.org/10.1158/1055-9965.EPI-12-0361.10.1158/1055-9965.EPI-12-0361Suche in Google Scholar PubMed PubMed Central

Kuan, P. F. and D. Y. Chiang (2012): “Integrating prior knowledge in multiple testing under dependence with applications to detecting differential dna methylation,” Biometrics, 68(3), 774–783, URL http://dx.doi.org/10.1111/j.1541-0420.2011.01730.x.10.1111/j.1541-0420.2011.01730.xSuche in Google Scholar PubMed PubMed Central

Kuan, P. F., S. Wang, X. Zhou and H. Chu (2010): “A statistical framework for illumina dna methylation arrays,” Bioinformatics, 26, 2849–2855, URL http://dx.doi.org/10.1093/bioinformatics/btq553.10.1093/bioinformatics/btq553Suche in Google Scholar PubMed PubMed Central

Laird, P. W. (2003): “The power and the promise of dna methylation markers,” Nat. Rev. Cancer, 3, 253–266, URL http://dx.doi.org/10.1038/nrc1045.10.1038/nrc1045Suche in Google Scholar PubMed

Laird, P. W. (2010): “Principles and challenges of genomewide DNA methylation analysis,” Nat. Rev. Genet., 11, 191–203, URL http://dx.doi.org/10.1038/nrg2732.10.1038/nrg2732Suche in Google Scholar PubMed

Langevin, S. M., D. C. Koestler, B. C. Christensen, R. A. Butler, J. K. Wiencke, H. H. Nelson, E. A. Houseman, C. J. Marsit and K. T. Kelsey (2012): “Peripheral blood dna methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study,” Epigenetics, 7, 291–299, URL http://dx.doi.org/10.4161/epi.7.3.19134.10.4161/epi.7.3.19134Suche in Google Scholar PubMed PubMed Central

Laurila, K., B. Oster, C. L. Andersen, P. Lamy, T. Orntoft, O. Yli-Harja and C. Wiuf (2011): “A beta-mixture model for dimensionality reduction, sample classification and analysis,” BMC Bioinformatics, 12, 215, URL http://dx.doi.org/10.1186/1471-2105-12-215.10.1186/1471-2105-12-215Suche in Google Scholar PubMed PubMed Central

Lindsay, B., C. C. Clogg and J. Grego (1991): “Semiparametric estimation in the rasch model and related exponential response models, including a simple latent class model for item analysis,” J. Am. Stat. Assoc., 86, 96–107.10.1080/01621459.1991.10475008Suche in Google Scholar

Marsit, C. J., D. C. Koestler, B. C. Christensen, M. R. Karagas, E. A. Houseman and K. T. Kelsey (2011): “Dna methylation array analysis identifies profiles of blood-derived dna methylation associated with bladder cancer,” J. Clin. Oncol., 29, 1133–1139, URL http://dx.doi.org/10.1200/JCO.2010.31.3577.10.1200/JCO.2010.31.3577Suche in Google Scholar PubMed PubMed Central

Mousa, A. A., K. J. Archer, R. Cappello, G. Estrada-Gutierrez, C. R. Isaacs, J. F. Strauss and S. W. Walsh (2012): “Dna methylation is altered in maternal blood vessels of women with preeclampsia,” Reprod. Sci., 19(12), 1332–1342, URL http://dx.doi.org/10.1177/1933719112450336.10.1177/1933719112450336Suche in Google Scholar PubMed PubMed Central

Nautiyal, S., V. E. H. Carlton, Y. Lu, J. S. Ireland, D. Flaucher, M. Moorhead, J. W. Gray, P. Spellman, M. Mindrinos, P. Berg and M. Faham (2010): “High-throughput method for analyzing methylation of cpgs in targeted genomic regions,” Proc. Natl. Acad. Sci. USA, 107, 12587–12592, URL http://dx.doi.org/10.1073/pnas.1005173107.10.1073/pnas.1005173107Suche in Google Scholar PubMed PubMed Central

Rand, W. (1971): “Objective criteria for the evaluation of clustering methods,” J. Am. Stat. Assoc., 66(336), 846–850.10.1080/01621459.1971.10482356Suche in Google Scholar

Rocke, D. M. (1993): “On the beta transformation family,” Technometrics, 35, 72–81.10.1080/00401706.1993.10484995Suche in Google Scholar

Schwartz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6(2), 461–464.Suche in Google Scholar

Siegmund, K., P. Laird and I. A. Laird-Offringa (2003): “A comparison of cluster analysis methods using dna methylation data,” Bioinformatics, 20, 1896–1904.10.1093/bioinformatics/bth176Suche in Google Scholar PubMed

van der Laan, M. and K. Pollard (2003): “A new algorithm for hybrid heirarchical clustering with visualization and the bootstrap,” J. Stat. Plan. Infer., 117, 275–303.Suche in Google Scholar

Verkuilen, J. and M. Smithson (2012): “Mixed and mixture regression models for continuous bounded responses using the beta distribution,” J. Educ. Behav. Stat., 37(1), 82–113.10.3102/1076998610396895Suche in Google Scholar

Ward, J. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58(301), 236–244.10.1080/01621459.1963.10500845Suche in Google Scholar

Zhai, R., Y. Zhao, L. Su, L. Cassidy, G. Liu and D. C. Christiani (2012): “Genomewide dna methylation profiling of cell-free serum dna in esophageal adenocarcinoma and barrett esophagus,” Neoplasia, 14, 29–33.10.1593/neo.111626Suche in Google Scholar PubMed PubMed Central

Published Online: 2013-03-05

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/sagmb-2012-0068

Schlagwörter für diesen Artikel

finite mixture models epigenetics; genomic data; model-based clustering