Abstract
Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.
Funding source: Agence Nationale de la Recherche
Award Identifier / Grant number: ANR-11-LABX-0025-01
Award Identifier / Grant number: ANR-15-IDEX-02
Award Identifier / Grant number: ANR-18-CE36-0005
Acknowledgments
This article was developed in the framework of the Grenoble Alpes Data Institute, supported by the French National Research Agency under the Investissements d’Avenir program (ANR-15-IDEX-02). It received support from LabEx PERSYVAL Lab, ANR-11-LABX-0025-01, and from the French National Research Agency (Agence Nationale pour la Recherche) ETAPE, ANR-18-CE36-0005. The EDEN mother-child study was supported by Foundation for medical research (FRM), National Agency for Research (ANR), National Institute for Research in Public health (IRESP), French Ministry of Health (DGS), French Ministry of Research, INSERM Bone and Joint Diseases National Research (PRO-A), and Human Nutrition National Research Programs, Nestlé, French National Institute for Population Health Surveillance (InVS), French National Institute for Health Education (INPES), the European Union FP7 programmes (FP7/2007–2013, HELIX, ESCAPE, ENRIECO, Medall projects), Diabetes National Research Program, French Agency for Environmental Health Safety (ANSES), Mutuelle Générale de l’Education Nationale (MGEN), French national agency for food security, French-speaking association for the study of diabetes and metabolism (ALFEDIAM). We thank all the participants and members of the EDEN mother-child cohort study group.
-
Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: None declared.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
Abraham, E., Rousseaux, S., Agier, L., Giorgis-Allemand, L., Tost, J., Galineau, J., Hulin, A., Siroux, V., Vaiman, D., Charles, M.-A., et al.. (2018). Pregnancy exposure to atmospheric pollution and meteorological conditions and placental DNA methylation. Environ. Int. 118: 334–347. https://doi.org/10.1016/j.envint.2018.05.007.Suche in Google Scholar PubMed
Atwell, S., Huang, Y.S., Vilhjalmsson, B.J., Willems, G., Horton, M., Li, Y., Meng, D., Platt, A., Tarone, A.M., Hu, T.T., et al.. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631. https://doi.org/10.1038/nature08800.Suche in Google Scholar PubMed PubMed Central
Balding, D.J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7: 781–791. https://doi.org/10.1038/nrg1916.Suche in Google Scholar PubMed
Battram, T., Yousefi, P., Crawford, G., Prince, C., Babei, M.S., Sharp, G., Hatcher, C., Vega-Salas, M.J., Khodabakhsh, S., Whitehurst, O., et al.. (2021). The EWAS catalog: a database of epigenome-wide association studies. Technical Report, OSF Preprints, Available at: https://osf.io/837wn/.10.31219/osf.io/837wnSuche in Google Scholar
Bertsekas, D. (1995). Nonlinear programming. J. Oper. Res. Soc. 48: 334. https://doi.org/10.1057/palgrave.jors.2600425.Suche in Google Scholar
Buja, A. and Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behav. Res. 27: 509–540. https://doi.org/10.1207/s15327906mbr2704_2.Suche in Google Scholar PubMed
Buniello, A., MacArthur, J.A.L., Cerezo, M., Harris, L.W., Hayhurst, J., Malangone, C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al.. (2019). The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47: D1005–D1012. https://doi.org/10.1093/nar/gky1120.Suche in Google Scholar PubMed PubMed Central
Byzova, M.V., Franken, J., Aarts, M.G., de Almeida-Engler, J., Engler, G., Mariani, C., Van Lookeren Campagne, M.M., and Angenent, G.C. (1999). Arabidopsis STERILE APETALA, a multifunctional gene regulating inflorescence, flower, and ovule development. Genes Dev. 13: 1002–1014. https://doi.org/10.1101/gad.13.8.1002.Suche in Google Scholar PubMed PubMed Central
Cai, J.-F., Candes, E.J., and Shen, Z. (2008). A singular value thresholding algorithm for matrix completion, Available at: http://arxiv.org/abs/0810.32860810.3286.10.1137/080738970Suche in Google Scholar
Cardenas, A., Lutz, S.M., Everson, T.M., Perron, P., Bouchard, L., and Hivert, M.-F. (2019). Mediation by placental DNA methylation of the association of prenatal maternal smoking and birth weight. Am. J. Epidemiol. 188: 1878–1886. https://doi.org/10.1093/aje/kwz184.Suche in Google Scholar PubMed PubMed Central
Carvalho, C.M., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Am. Stat. Assoc. 103: 1438–1456.10.1198/016214508000000869Suche in Google Scholar PubMed PubMed Central
Caye, K., Jumentier, B., Lepeule, J., and Francois, O. (2019). LFMM 2: fast and accurate inference of gene-environment associations in genome-wide studies. Mol. Biol. Evol. 36: 852–860. https://doi.org/10.1093/molbev/msz008.Suche in Google Scholar PubMed PubMed Central
Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55: 997–1004. https://doi.org/10.1111/j.0006-341x.1999.00997.x.Suche in Google Scholar PubMed
Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1: 211–218. https://doi.org/10.1007/BF02288367.Suche in Google Scholar
Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc. 99: 96–104. https://doi.org/10.1198/016214504000000089.Suche in Google Scholar
Everson, T.M., Vives-Usano, M., Seyve, E., Cardenas, A., Lacasana, M., Craig, J.M., Lesseur, C., Baker, E.R., Fernandez-Jimenez, N., Heude, B., et al.. (2019). Placental DNA methylation signatures of maternal smoking during pregnancy and potential impacts on fetal growth. preprint, Genomics, Available at: http://biorxiv.org/lookup/doi/10.1101/663567.10.1038/s41467-021-24558-ySuche in Google Scholar PubMed PubMed Central
Francois, O. and Caye, K. (2018). Naturalgwas: an R package for evaluating genomewide association methods with empirical data. Mol. Ecol. Resour. 18: 789–797. https://doi.org/10.1111/1755-0998.12892.Suche in Google Scholar PubMed
Frichot, E., Schoville, S.D., Bouchard, G., and Francois, O. (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Mol. Biol. Evol. 30: 1687–1699. https://doi.org/10.1093/molbev/mst063.Suche in Google Scholar PubMed PubMed Central
Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1: 302–332. https://doi.org/10.1214/07-AOAS131.Suche in Google Scholar
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33: 1548–7660. https://doi.org/10.18637/jss.v033.i01.Suche in Google Scholar
Halko, N., Martinsson, P.G., and Tropp, J.A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53: 217–288. https://doi.org/10.1137/090771806.Suche in Google Scholar
Hanzawa, Y., Takahashi, T., and Komeda, Y. (1997). ACL5: an Arabidopsis gene required for internodal elongation after flowering. Plant J. 12: 863–874. https://doi.org/10.1046/j.1365-313X.1997.12040863.x.Suche in Google Scholar
Heude, B., Forhan, A., Slama, R., Douhaud, L., Bedel, S., Saurel-Cubizolles, M.-J., Hankard, R., Thiebaugeorges, O., De Agostini, M., Annesi-Maesano, I., et al.. (2016). Cohort Profile: the EDEN mother-child cohort on the prenatal and early postnatal determinants of child health and development. Int. J. Epidemiol. 45: 353–363. https://doi.org/10.1093/ije/dyv151.Suche in Google Scholar PubMed
Hoggart, C.J., Clark, T.G., De Iorio, M., Whittaker, J.C., and Balding, D.J. (2008). Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32: 179–185. https://doi.org/10.1002/gepi.20292.Suche in Google Scholar PubMed
Kalaitzis, A. and Lawrence, N. (2012). Residual component analysis: generalising PCA for more flexible inference in linear-Gaussian models, arXiv:1206.4560[cs, stat].Suche in Google Scholar
Kaushal, A., Zhang, H., Karmaus, W.J.J., Ray, M., Torres, M.A., Smith, A.K., and Wang, S.-L. (2017). Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinf. 18. https://doi.org/10.1186/s12859-017-1611-2.Suche in Google Scholar PubMed PubMed Central
Lee, S., Sun, W., Wright, F.A., and Zou, F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 104: 303–316. https://doi.org/10.1093/biomet/asx018.Suche in Google Scholar PubMed PubMed Central
Leek, J.T. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67: 344–352. https://doi.org/10.1111/j.1541-0420.2010.01455.x.Suche in Google Scholar PubMed PubMed Central
Leek, J.T. and Storey, J.D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3: 12.10.1371/journal.pgen.0030161Suche in Google Scholar
Lotterhos, K. (2019). The effect of neutral recombination variation on genome scans for selection. Biometrika 9: 1851–1867.10.1534/g3.119.400088Suche in Google Scholar PubMed PubMed Central
Lumley, J., Chamberlain, C., Dowswell, T., Oliver, S., Oakley, L., and Watson, L. (2009). Interventions for promoting smoking cessation during pregnancy. Cochrane Database Syst. Rev. 3: CD001055. https://doi.org/10.1002/14651858.CD001055.pub3.Suche in Google Scholar PubMed PubMed Central
Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013). Low-rank optimization with trace norm penalty, arXiv: 1112.2318[cs, math].10.1137/110859646Suche in Google Scholar
Morales, E., Vilahur, N., Salas, L.A., Motta, V., Fernandez, M.F., Murcia, M., Llop, S., Tardon, A., Fernandez-Tardon, G., Santa-Marina, L., et al.. (2016). Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy. Int. J. Epidemiol. 45: 1644–1655. https://doi.org/10.1093/ije/dyw196.Suche in Google Scholar PubMed
Nishimura, N., Tsuchiya, W., Moresco, J.J., Hayashi, Y., Satoh, K., Kaiwa, N., Irisa, T., Kinoshita, T., Schroeder, J.I., Yates, J.R., et al.. (2018). Control of seed dormancy and germination by DOG1-AHG1 PP2C phosphatase complex via binding to heme. Nat. Commun. 9. https://doi.org/10.1038/s41467-018-04437-9.Suche in Google Scholar PubMed PubMed Central
Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Stat. 92: 1004–1016.10.1162/REST_a_00043Suche in Google Scholar
Owen, A.B. and Wang, J. (2016). Bi-cross-validation for factor analysis. Stat. Sci. 31: 119–139.10.1214/15-STS539Suche in Google Scholar
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38: 904–909. https://doi.org/10.1038/ng1847.Suche in Google Scholar PubMed
Rakyan, V.K., Down, T.A., Balding, D.J., and Beck, S. (2011). Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 12: 529–541. https://doi.org/10.1038/nrg3000.Suche in Google Scholar PubMed PubMed Central
Rousseaux, S., Seyve, E., Chuffart, F., Bourova-Flin, E., Benmerad, M., Charles, M.-A., Forhan, A., Heude, B., Siroux, V., Slama, R., et al.., and The EDEN mother-child cohort study group (2019). Maternal exposure to cigarette smoking induces immediate and durable changes in placental DNA methylation affecting enhancer and imprinting control regions. preprint, Genomics, Available at: http://biorxiv.org/lookup/doi/10.1101/852186.10.1101/852186Suche in Google Scholar
She, Y. and Chen, K. (2017). Robust reduced-rank regression. Biometrika 104: 633–647. https://doi.org/10.1093/biomet/asx032.Suche in Google Scholar PubMed PubMed Central
Sheldon, C.C., Rouse, D.T., Finnegan, E.J., Peacock, W.J., and Dennis, E.S. (2000). The molecular basis of vernalization: the central role of FLOWERING LOCUS C (FLC). Plant Biol. 97: 6.10.1073/pnas.97.7.3753Suche in Google Scholar PubMed PubMed Central
Storey, J.D., Bass, A.J., Dabney, A., Robinson, D., and Warnes, G. (2021). qvalue: Q-value estimation for false discovery rate control, Available at: https://bioconductor.org/packages/qvalue/.Suche in Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58: 267–288.10.1111/j.2517-6161.1996.tb02080.xSuche in Google Scholar
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theor. Appl. 109: 475–494. https://doi.org/10.1023/A:1017501703105.10.1023/A:1017501703105Suche in Google Scholar
Wang, J., Zhao, Q., Hastie, T., and Owen, A.B. (2017). Confounder adjustment in multiple hypothesis testing. Ann. Stat. 45: 1863–1894. https://doi.org/10.1214/16-AOS1511.Suche in Google Scholar PubMed PubMed Central
Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25: 714–721. https://doi.org/10.1093/bioinformatics/btp041.Suche in Google Scholar PubMed PubMed Central
Zeng, P., Zhou, X., and Huang, S. (2017). Prediction of gene expression with cis-SNPs using mixed models and regularization methods. BMC Genom. 18: 368. https://doi.org/10.1186/s12864-017-3759-6.Suche in Google Scholar PubMed PubMed Central
Zhou, X., Carbonetto, P., and Stephens, M. (2013). Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 9: e1003264. https://doi.org/10.1371/journal.pgen.1003264.Suche in Google Scholar PubMed PubMed Central
Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44: 821–824. https://doi.org/10.1038/ng.2310.Suche in Google Scholar PubMed PubMed Central
Supplementary Material
The online version of this article offers supplementary material (https://doi.org/10.1515/sagmb-2021-0035).
© 2022 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Review Article
- Challenges for machine learning in RNA-protein interaction prediction
- Research Articles
- Distinct characteristics of correlation analysis at the single-cell and the population level
- pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples
- Use of SVM-based ensemble feature selection method for gene expression data analysis
- A robust association test with multiple genetic variants and covariates
- Estimation of the covariance structure from SNP allele frequencies
- GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing
- Sparse latent factor regression models for genome-wide and epigenome-wide association studies
Artikel in diesem Heft
- Review Article
- Challenges for machine learning in RNA-protein interaction prediction
- Research Articles
- Distinct characteristics of correlation analysis at the single-cell and the population level
- pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples
- Use of SVM-based ensemble feature selection method for gene expression data analysis
- A robust association test with multiple genetic variants and covariates
- Estimation of the covariance structure from SNP allele frequencies
- GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing
- Sparse latent factor regression models for genome-wide and epigenome-wide association studies