Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures
Abstract
DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.
Appendix
The formula below makes explicit two facts about the Euclidean metric: (1) it remains unaffected by autocorrelated loci (since its expectation depends on the variance-covariance matrix only through the diagonal); and (2) it is influenced by all loci, including those that are non-informative and possibly noisy (with noisy loci contributing the most, even if they are not informative).
For independent random vectors Y1 and Y2,
With δj=1(θ1j=θ2j), the following equations make clear that in correctly-specified mixture models, non-informative loci have no influence on classification (via posterior class membership probability):
where Consequently, terms that depend on
factor out of the empirical Bayes formula for classification via posterior class membership probability:
Code for implementing the proposed methods was written in the R statistical language (http://cran.r-project.org/)and be found on the first author’s website (http://bio-epi.hitchcock.org/faculty/koestler.html). Instructions for downloading and usage are provided there.
References
Arcones, M. and E. Gine (1992): “On the bootstrap of u and v statistics,” Ann. Stat., 20(2), 655–674.Suche in Google Scholar
Banister, C. E., D. C. Koestler, M. A. Maccani, J. F. Padbury, E. A. Houseman and C. J. Marsit (2011): “Infant growth restriction is associated with distinct patterns of dna methylation in human placentas,” Epigenetics, 6, 920–927, URL http://dx.doi.org/10.4161/epi.6.7.16079. 10.4161/epi.6.7.16079Suche in Google Scholar PubMed PubMed Central
Breslow, N. E. and D. G. Clayton (1993): “Approximate inference in generalized linear mixed models,” J. Am. Stat. Assoc., 88, 9–25.Suche in Google Scholar
Chen, J. (1995): “Optimal rate of convergence for finite mixture models,” Ann. Stat., 23, 221–233.Suche in Google Scholar
Christensen, B. C., E. A. Houseman, J. J. Godleski, C. J. Marsit, J. L. Longacker, C. R. Roelofs, M. R. Karagas, M. R. Wrensch, R.-F. Yeh, H. H. Nelson, J. L. Wiemels, S. Zheng, J. K. Wiencke, R. Bueno, D. J. Sugarbaker and K. T. Kelsey (2009): “Epigenetic profiles distinguish pleural mesothelioma from normal pleura and predict lung asbestos burden and clinical outcome,” Cancer Res., 69, 227–234, URL http://dx.doi.org/10.1158/0008-5472.CAN-08-2586.10.1158/0008-5472.CAN-08-2586Suche in Google Scholar PubMed PubMed Central
Christensen, B. C., A. A. Smith, S. Zheng, D. C. Koestler, E. A. Houseman, C. J. Marsit, J. L. Wiemels, H. H. Nelson, M. R. Karagas, M. R. Wrensch, K. T. Kelsey and J. K. Wiencke (2011): “Dna methylation, isocitrate dehydrogenase mutation, and survival in glioma,” J. Natl. Cancer Inst., 103, 143–153, URL http://dx.doi.org/10.1093/jnci/djq497.10.1093/jnci/djq497Suche in Google Scholar PubMed PubMed Central
Dasgupta, A. and A. Raftery (1998): “Detecting features in spatial point processes with clutter via model-based clustering,” J. Am. Stat. Assoc., 93, 294–302.Suche in Google Scholar
Dempster, A., N. Laird and D. Rubin (1977): “Maximum likelihhod from incomplete data via the em algorithm,” J. R. Stat. Soc. B, 39, 1–38.Suche in Google Scholar
Ehrich, M., J. Turner, P. Gibbs, L. Lipton, M. Giovanneti, C. Cantor and D. van den Boom (2008): “Cytosine methylation profiling of cancer cell lines,” Proc. Natl. Acad. Sci. USA, 105, 4844–4849, URL http://dx.doi.org/10.1073/pnas.0712251105.10.1073/pnas.0712251105Suche in Google Scholar PubMed PubMed Central
Fraley, C. and A. Raftery (2002): “Model-based clustering, discriminant analysis, and density estimation,” J. Am. Stat. Assoc., 97(458), 611–631.10.1198/016214502760047131Suche in Google Scholar
Grigoriu, A., J. C. Ferreira, S. Choufani, D. Baczyk, J. Kingdom and R. Weksberg (2011): “Cell specific patterns of methylation in the human placenta,” Epigenetics, 6, 368–379.10.4161/epi.6.3.14196Suche in Google Scholar PubMed PubMed Central
Hinoue, T., D. J. Weisenberger, C. P. E. Lange, H. Shen, H.-M. Byun, D. V. D. Berg, S. Malik, F. Pan, H. Noushmehr, C. M. van Dijk, R. A. E. M. Tollenaar and P. W. Laird (2012): “Genome-scale analysis of aberrant dna methylation in colorectal cancer,” Genome Res., 22, 271–282, URL http://dx.doi.org/10.1101/gr.117523.110.10.1101/gr.117523.110Suche in Google Scholar PubMed PubMed Central
Houseman, E. and B. Coull (2004): “Cholesky residuals for assessing normal errors in a linear model with correlated outcomes,” J. Am. Stat. Assoc., 99(466), 383–394.10.1198/016214504000000403Suche in Google Scholar
Houseman, E. A., B. C. Christensen, R.-F. Yeh, C. J. Marsit, M. R. Karagas, M. Wrensch, H. H. Nelson, J. Wiemels, S. Zheng, J. K. Wiencke and K. T. Kelsey (2008): “Model-based clustering of dna methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions,” BMC Bioinformatics, 9, 365, URL http://dx.doi.org/10.1186/1471-2105-9-365.10.1186/1471-2105-9-365Suche in Google Scholar PubMed PubMed Central
Houseman, E. A., B. C. Christensen, M. R. Karagas, M. R. Wrensch, H. H. Nelson, J. L. Wiemels, S. Zheng, J. K. Wiencke, K. T. Kelsey and C. J. Marsit (2009): “Copy number variation has little impact on bead-arraybased measures of dna methylation,” Bioinformatics, 25, 1999–2005, URL http://dx.doi.org/10.1093/bioinformatics/btp364.10.1093/bioinformatics/btp364Suche in Google Scholar PubMed PubMed Central
Houshdaran, S., S. Hawley, C. Palmer, M. Campan, M. N. Olsen, A. P. Ventura, B. S. Knudsen, C. W. Drescher, N. D. Urban, P. O. Brown and P. W. Laird (2010): “Dna methylation profiles of ovarian epithelial carcinoma tumors and cell lines,” PLoS One, 5, e9359, URL http://dx.doi.org/10.1371/journal.pone.0009359.10.1371/journal.pone.0009359Suche in Google Scholar PubMed PubMed Central
Ji, Y., C. Wu, P. Liu, J. Wang and K. R. Coombes (2005): “Applications of beta-mixture models in bioinformatics,” Bioinformatics, 21, 2118–2122, URL http://dx.doi.org/10.1093/bioinformatics/bti318.10.1093/bioinformatics/bti318Suche in Google Scholar PubMed
Joubert, B. R., S. E. Hberg, R. M. Nilsen, X. Wang, S. E. Vollset, S. K. Murphy, Z. Huang, C. Hoyo, O. Midttun, L. A. Cupul-Uicab, P. M. Ueland, M. C. Wu, W. Nystad, D. A. Bell, S. D. Peddada and S. J. London (2012): “450k epigenome-wide scan identifies differential dna methylation in newborns related to maternal smoking during pregnancy,” Environ. Health Perspect., 120(10), 1425–1431, URL http://dx.doi.org/10.1289/ehp.1205412.10.1289/ehp.1205412Suche in Google Scholar PubMed PubMed Central
Kennedy, W. and J. Gentle (1980): Statistical computing, Marcel Dekker, New York.Suche in Google Scholar
Koestler, D. C., C. J. Marsit, B. C. Christensen, M. R. Karagas, R. Bueno, D. J. Sugarbaker, K. T. Kelsey and E. A. Houseman (2010): “Semi-supervised recursively partitioned mixture models for identifying cancer subtypes,” Bioinformatics, 26, 2578–2585, URL http://dx.doi.org/10.1093/bioinformatics/btq470.10.1093/bioinformatics/btq470Suche in Google Scholar PubMed PubMed Central
Koestler, D. C., C. J. Marsit, B. C. Christensen, W. Accomando, S. M. Langevin, E. A. Houseman, H. H. Nelson, M. R. Karagas, J. K. Wiencke and K. T. Kelsey (2012): “Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers,” Cancer Epidemiol. Biomarkers Prev., 21, 1293–1302, URL http://dx.doi.org/10.1158/1055-9965.EPI-12-0361.10.1158/1055-9965.EPI-12-0361Suche in Google Scholar PubMed PubMed Central
Kuan, P. F. and D. Y. Chiang (2012): “Integrating prior knowledge in multiple testing under dependence with applications to detecting differential dna methylation,” Biometrics, 68(3), 774–783, URL http://dx.doi.org/10.1111/j.1541-0420.2011.01730.x.10.1111/j.1541-0420.2011.01730.xSuche in Google Scholar PubMed PubMed Central
Kuan, P. F., S. Wang, X. Zhou and H. Chu (2010): “A statistical framework for illumina dna methylation arrays,” Bioinformatics, 26, 2849–2855, URL http://dx.doi.org/10.1093/bioinformatics/btq553.10.1093/bioinformatics/btq553Suche in Google Scholar PubMed PubMed Central
Laird, P. W. (2003): “The power and the promise of dna methylation markers,” Nat. Rev. Cancer, 3, 253–266, URL http://dx.doi.org/10.1038/nrc1045.10.1038/nrc1045Suche in Google Scholar PubMed
Laird, P. W. (2010): “Principles and challenges of genomewide DNA methylation analysis,” Nat. Rev. Genet., 11, 191–203, URL http://dx.doi.org/10.1038/nrg2732.10.1038/nrg2732Suche in Google Scholar PubMed
Langevin, S. M., D. C. Koestler, B. C. Christensen, R. A. Butler, J. K. Wiencke, H. H. Nelson, E. A. Houseman, C. J. Marsit and K. T. Kelsey (2012): “Peripheral blood dna methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study,” Epigenetics, 7, 291–299, URL http://dx.doi.org/10.4161/epi.7.3.19134.10.4161/epi.7.3.19134Suche in Google Scholar PubMed PubMed Central
Laurila, K., B. Oster, C. L. Andersen, P. Lamy, T. Orntoft, O. Yli-Harja and C. Wiuf (2011): “A beta-mixture model for dimensionality reduction, sample classification and analysis,” BMC Bioinformatics, 12, 215, URL http://dx.doi.org/10.1186/1471-2105-12-215.10.1186/1471-2105-12-215Suche in Google Scholar PubMed PubMed Central
Lindsay, B., C. C. Clogg and J. Grego (1991): “Semiparametric estimation in the rasch model and related exponential response models, including a simple latent class model for item analysis,” J. Am. Stat. Assoc., 86, 96–107.10.1080/01621459.1991.10475008Suche in Google Scholar
Marsit, C. J., D. C. Koestler, B. C. Christensen, M. R. Karagas, E. A. Houseman and K. T. Kelsey (2011): “Dna methylation array analysis identifies profiles of blood-derived dna methylation associated with bladder cancer,” J. Clin. Oncol., 29, 1133–1139, URL http://dx.doi.org/10.1200/JCO.2010.31.3577.10.1200/JCO.2010.31.3577Suche in Google Scholar PubMed PubMed Central
Mousa, A. A., K. J. Archer, R. Cappello, G. Estrada-Gutierrez, C. R. Isaacs, J. F. Strauss and S. W. Walsh (2012): “Dna methylation is altered in maternal blood vessels of women with preeclampsia,” Reprod. Sci., 19(12), 1332–1342, URL http://dx.doi.org/10.1177/1933719112450336.10.1177/1933719112450336Suche in Google Scholar PubMed PubMed Central
Nautiyal, S., V. E. H. Carlton, Y. Lu, J. S. Ireland, D. Flaucher, M. Moorhead, J. W. Gray, P. Spellman, M. Mindrinos, P. Berg and M. Faham (2010): “High-throughput method for analyzing methylation of cpgs in targeted genomic regions,” Proc. Natl. Acad. Sci. USA, 107, 12587–12592, URL http://dx.doi.org/10.1073/pnas.1005173107.10.1073/pnas.1005173107Suche in Google Scholar PubMed PubMed Central
Rand, W. (1971): “Objective criteria for the evaluation of clustering methods,” J. Am. Stat. Assoc., 66(336), 846–850.10.1080/01621459.1971.10482356Suche in Google Scholar
Rocke, D. M. (1993): “On the beta transformation family,” Technometrics, 35, 72–81.10.1080/00401706.1993.10484995Suche in Google Scholar
Schwartz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6(2), 461–464.Suche in Google Scholar
Siegmund, K., P. Laird and I. A. Laird-Offringa (2003): “A comparison of cluster analysis methods using dna methylation data,” Bioinformatics, 20, 1896–1904.10.1093/bioinformatics/bth176Suche in Google Scholar PubMed
van der Laan, M. and K. Pollard (2003): “A new algorithm for hybrid heirarchical clustering with visualization and the bootstrap,” J. Stat. Plan. Infer., 117, 275–303.Suche in Google Scholar
Verkuilen, J. and M. Smithson (2012): “Mixed and mixture regression models for continuous bounded responses using the beta distribution,” J. Educ. Behav. Stat., 37(1), 82–113.10.3102/1076998610396895Suche in Google Scholar
Ward, J. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58(301), 236–244.10.1080/01621459.1963.10500845Suche in Google Scholar
Zhai, R., Y. Zhao, L. Su, L. Cassidy, G. Liu and D. C. Christiani (2012): “Genomewide dna methylation profiling of cell-free serum dna in esophageal adenocarcinoma and barrett esophagus,” Neoplasia, 14, 29–33.10.1593/neo.111626Suche in Google Scholar PubMed PubMed Central
©2013 by Walter de Gruyter Berlin Boston
Artikel in diesem Heft
- Approximate Bayesian computation (ABC) gives exact results under the assumption of model error
- Modeling the DNA copy number aberration patterns in observational high-throughput cancer data
- Exploring the sampling universe of RNA-seq
- Detection of epigenetic changes using ANOVA with spatially varying coefficients
- Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery
- Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures
- A novel method for analyzing genetic association with longitudinal phenotypes
- Two optimization strategies of multi-stage design in clinical proteomic studies
Artikel in diesem Heft
- Approximate Bayesian computation (ABC) gives exact results under the assumption of model error
- Modeling the DNA copy number aberration patterns in observational high-throughput cancer data
- Exploring the sampling universe of RNA-seq
- Detection of epigenetic changes using ANOVA with spatially varying coefficients
- Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery
- Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures
- A novel method for analyzing genetic association with longitudinal phenotypes
- Two optimization strategies of multi-stage design in clinical proteomic studies