Startseite Sparse latent factor regression models for genome-wide and epigenome-wide association studies
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Sparse latent factor regression models for genome-wide and epigenome-wide association studies

  • Basile Jumentier , Kevin Caye , Barbara Heude , Johanna Lepeule EMAIL logo und Olivier François ORCID logo EMAIL logo
Veröffentlicht/Copyright: 7. März 2022

Abstract

Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.


Corresponding authors: Johanna Lepeule, Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences, INSERM U 1209, CNRS UMR 5309, Université Grenoble-Alpes, Grenoble, 38000, France, E-mail: ; and Olivier François, Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France; and Inria Grenoble, Equipe Statify, Laboratoire Jean Kuntzmann, Rhône-Alpes Inovallée 655 Avenue de l’Europe - CS 90051, Montbonnot, 38334, France, E-mail:

Award Identifier / Grant number: ANR-11-LABX-0025-01

Award Identifier / Grant number: ANR-15-IDEX-02

Award Identifier / Grant number: ANR-18-CE36-0005

Acknowledgments

This article was developed in the framework of the Grenoble Alpes Data Institute, supported by the French National Research Agency under the Investissements d’Avenir program (ANR-15-IDEX-02). It received support from LabEx PERSYVAL Lab, ANR-11-LABX-0025-01, and from the French National Research Agency (Agence Nationale pour la Recherche) ETAPE, ANR-18-CE36-0005. The EDEN mother-child study was supported by Foundation for medical research (FRM), National Agency for Research (ANR), National Institute for Research in Public health (IRESP), French Ministry of Health (DGS), French Ministry of Research, INSERM Bone and Joint Diseases National Research (PRO-A), and Human Nutrition National Research Programs, Nestlé, French National Institute for Population Health Surveillance (InVS), French National Institute for Health Education (INPES), the European Union FP7 programmes (FP7/2007–2013, HELIX, ESCAPE, ENRIECO, Medall projects), Diabetes National Research Program, French Agency for Environmental Health Safety (ANSES), Mutuelle Générale de l’Education Nationale (MGEN), French national agency for food security, French-speaking association for the study of diabetes and metabolism (ALFEDIAM). We thank all the participants and members of the EDEN mother-child cohort study group.

  1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: None declared.

  3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

Abraham, E., Rousseaux, S., Agier, L., Giorgis-Allemand, L., Tost, J., Galineau, J., Hulin, A., Siroux, V., Vaiman, D., Charles, M.-A., et al.. (2018). Pregnancy exposure to atmospheric pollution and meteorological conditions and placental DNA methylation. Environ. Int. 118: 334–347. https://doi.org/10.1016/j.envint.2018.05.007.Suche in Google Scholar PubMed

Atwell, S., Huang, Y.S., Vilhjalmsson, B.J., Willems, G., Horton, M., Li, Y., Meng, D., Platt, A., Tarone, A.M., Hu, T.T., et al.. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631. https://doi.org/10.1038/nature08800.Suche in Google Scholar PubMed PubMed Central

Balding, D.J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7: 781–791. https://doi.org/10.1038/nrg1916.Suche in Google Scholar PubMed

Battram, T., Yousefi, P., Crawford, G., Prince, C., Babei, M.S., Sharp, G., Hatcher, C., Vega-Salas, M.J., Khodabakhsh, S., Whitehurst, O., et al.. (2021). The EWAS catalog: a database of epigenome-wide association studies. Technical Report, OSF Preprints, Available at: https://osf.io/837wn/.10.31219/osf.io/837wnSuche in Google Scholar

Bertsekas, D. (1995). Nonlinear programming. J. Oper. Res. Soc. 48: 334. https://doi.org/10.1057/palgrave.jors.2600425.Suche in Google Scholar

Buja, A. and Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behav. Res. 27: 509–540. https://doi.org/10.1207/s15327906mbr2704_2.Suche in Google Scholar PubMed

Buniello, A., MacArthur, J.A.L., Cerezo, M., Harris, L.W., Hayhurst, J., Malangone, C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al.. (2019). The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47: D1005–D1012. https://doi.org/10.1093/nar/gky1120.Suche in Google Scholar PubMed PubMed Central

Byzova, M.V., Franken, J., Aarts, M.G., de Almeida-Engler, J., Engler, G., Mariani, C., Van Lookeren Campagne, M.M., and Angenent, G.C. (1999). Arabidopsis STERILE APETALA, a multifunctional gene regulating inflorescence, flower, and ovule development. Genes Dev. 13: 1002–1014. https://doi.org/10.1101/gad.13.8.1002.Suche in Google Scholar PubMed PubMed Central

Cai, J.-F., Candes, E.J., and Shen, Z. (2008). A singular value thresholding algorithm for matrix completion, Available at: http://arxiv.org/abs/0810.32860810.3286.10.1137/080738970Suche in Google Scholar

Cardenas, A., Lutz, S.M., Everson, T.M., Perron, P., Bouchard, L., and Hivert, M.-F. (2019). Mediation by placental DNA methylation of the association of prenatal maternal smoking and birth weight. Am. J. Epidemiol. 188: 1878–1886. https://doi.org/10.1093/aje/kwz184.Suche in Google Scholar PubMed PubMed Central

Carvalho, C.M., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Am. Stat. Assoc. 103: 1438–1456.10.1198/016214508000000869Suche in Google Scholar PubMed PubMed Central

Caye, K., Jumentier, B., Lepeule, J., and Francois, O. (2019). LFMM 2: fast and accurate inference of gene-environment associations in genome-wide studies. Mol. Biol. Evol. 36: 852–860. https://doi.org/10.1093/molbev/msz008.Suche in Google Scholar PubMed PubMed Central

Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55: 997–1004. https://doi.org/10.1111/j.0006-341x.1999.00997.x.Suche in Google Scholar PubMed

Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1: 211–218. https://doi.org/10.1007/BF02288367.Suche in Google Scholar

Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc. 99: 96–104. https://doi.org/10.1198/016214504000000089.Suche in Google Scholar

Everson, T.M., Vives-Usano, M., Seyve, E., Cardenas, A., Lacasana, M., Craig, J.M., Lesseur, C., Baker, E.R., Fernandez-Jimenez, N., Heude, B., et al.. (2019). Placental DNA methylation signatures of maternal smoking during pregnancy and potential impacts on fetal growth. preprint, Genomics, Available at: http://biorxiv.org/lookup/doi/10.1101/663567.10.1038/s41467-021-24558-ySuche in Google Scholar PubMed PubMed Central

Francois, O. and Caye, K. (2018). Naturalgwas: an R package for evaluating genomewide association methods with empirical data. Mol. Ecol. Resour. 18: 789–797. https://doi.org/10.1111/1755-0998.12892.Suche in Google Scholar PubMed

Frichot, E., Schoville, S.D., Bouchard, G., and Francois, O. (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Mol. Biol. Evol. 30: 1687–1699. https://doi.org/10.1093/molbev/mst063.Suche in Google Scholar PubMed PubMed Central

Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1: 302–332. https://doi.org/10.1214/07-AOAS131.Suche in Google Scholar

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33: 1548–7660. https://doi.org/10.18637/jss.v033.i01.Suche in Google Scholar

Halko, N., Martinsson, P.G., and Tropp, J.A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53: 217–288. https://doi.org/10.1137/090771806.Suche in Google Scholar

Hanzawa, Y., Takahashi, T., and Komeda, Y. (1997). ACL5: an Arabidopsis gene required for internodal elongation after flowering. Plant J. 12: 863–874. https://doi.org/10.1046/j.1365-313X.1997.12040863.x.Suche in Google Scholar

Heude, B., Forhan, A., Slama, R., Douhaud, L., Bedel, S., Saurel-Cubizolles, M.-J., Hankard, R., Thiebaugeorges, O., De Agostini, M., Annesi-Maesano, I., et al.. (2016). Cohort Profile: the EDEN mother-child cohort on the prenatal and early postnatal determinants of child health and development. Int. J. Epidemiol. 45: 353–363. https://doi.org/10.1093/ije/dyv151.Suche in Google Scholar PubMed

Hoggart, C.J., Clark, T.G., De Iorio, M., Whittaker, J.C., and Balding, D.J. (2008). Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32: 179–185. https://doi.org/10.1002/gepi.20292.Suche in Google Scholar PubMed

Kalaitzis, A. and Lawrence, N. (2012). Residual component analysis: generalising PCA for more flexible inference in linear-Gaussian models, arXiv:1206.4560[cs, stat].Suche in Google Scholar

Kaushal, A., Zhang, H., Karmaus, W.J.J., Ray, M., Torres, M.A., Smith, A.K., and Wang, S.-L. (2017). Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinf. 18. https://doi.org/10.1186/s12859-017-1611-2.Suche in Google Scholar PubMed PubMed Central

Lee, S., Sun, W., Wright, F.A., and Zou, F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 104: 303–316. https://doi.org/10.1093/biomet/asx018.Suche in Google Scholar PubMed PubMed Central

Leek, J.T. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67: 344–352. https://doi.org/10.1111/j.1541-0420.2010.01455.x.Suche in Google Scholar PubMed PubMed Central

Leek, J.T. and Storey, J.D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3: 12.10.1371/journal.pgen.0030161Suche in Google Scholar

Lotterhos, K. (2019). The effect of neutral recombination variation on genome scans for selection. Biometrika 9: 1851–1867.10.1534/g3.119.400088Suche in Google Scholar PubMed PubMed Central

Lumley, J., Chamberlain, C., Dowswell, T., Oliver, S., Oakley, L., and Watson, L. (2009). Interventions for promoting smoking cessation during pregnancy. Cochrane Database Syst. Rev. 3: CD001055. https://doi.org/10.1002/14651858.CD001055.pub3.Suche in Google Scholar PubMed PubMed Central

Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013). Low-rank optimization with trace norm penalty, arXiv: 1112.2318[cs, math].10.1137/110859646Suche in Google Scholar

Morales, E., Vilahur, N., Salas, L.A., Motta, V., Fernandez, M.F., Murcia, M., Llop, S., Tardon, A., Fernandez-Tardon, G., Santa-Marina, L., et al.. (2016). Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy. Int. J. Epidemiol. 45: 1644–1655. https://doi.org/10.1093/ije/dyw196.Suche in Google Scholar PubMed

Nishimura, N., Tsuchiya, W., Moresco, J.J., Hayashi, Y., Satoh, K., Kaiwa, N., Irisa, T., Kinoshita, T., Schroeder, J.I., Yates, J.R., et al.. (2018). Control of seed dormancy and germination by DOG1-AHG1 PP2C phosphatase complex via binding to heme. Nat. Commun. 9. https://doi.org/10.1038/s41467-018-04437-9.Suche in Google Scholar PubMed PubMed Central

Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Stat. 92: 1004–1016.10.1162/REST_a_00043Suche in Google Scholar

Owen, A.B. and Wang, J. (2016). Bi-cross-validation for factor analysis. Stat. Sci. 31: 119–139.10.1214/15-STS539Suche in Google Scholar

Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38: 904–909. https://doi.org/10.1038/ng1847.Suche in Google Scholar PubMed

Rakyan, V.K., Down, T.A., Balding, D.J., and Beck, S. (2011). Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 12: 529–541. https://doi.org/10.1038/nrg3000.Suche in Google Scholar PubMed PubMed Central

Rousseaux, S., Seyve, E., Chuffart, F., Bourova-Flin, E., Benmerad, M., Charles, M.-A., Forhan, A., Heude, B., Siroux, V., Slama, R., et al.., and The EDEN mother-child cohort study group (2019). Maternal exposure to cigarette smoking induces immediate and durable changes in placental DNA methylation affecting enhancer and imprinting control regions. preprint, Genomics, Available at: http://biorxiv.org/lookup/doi/10.1101/852186.10.1101/852186Suche in Google Scholar

She, Y. and Chen, K. (2017). Robust reduced-rank regression. Biometrika 104: 633–647. https://doi.org/10.1093/biomet/asx032.Suche in Google Scholar PubMed PubMed Central

Sheldon, C.C., Rouse, D.T., Finnegan, E.J., Peacock, W.J., and Dennis, E.S. (2000). The molecular basis of vernalization: the central role of FLOWERING LOCUS C (FLC). Plant Biol. 97: 6.10.1073/pnas.97.7.3753Suche in Google Scholar PubMed PubMed Central

Storey, J.D., Bass, A.J., Dabney, A., Robinson, D., and Warnes, G. (2021). qvalue: Q-value estimation for false discovery rate control, Available at: https://bioconductor.org/packages/qvalue/.Suche in Google Scholar

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58: 267–288.10.1111/j.2517-6161.1996.tb02080.xSuche in Google Scholar

Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theor. Appl. 109: 475–494. https://doi.org/10.1023/A:1017501703105.10.1023/A:1017501703105Suche in Google Scholar

Wang, J., Zhao, Q., Hastie, T., and Owen, A.B. (2017). Confounder adjustment in multiple hypothesis testing. Ann. Stat. 45: 1863–1894. https://doi.org/10.1214/16-AOS1511.Suche in Google Scholar PubMed PubMed Central

Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25: 714–721. https://doi.org/10.1093/bioinformatics/btp041.Suche in Google Scholar PubMed PubMed Central

Zeng, P., Zhou, X., and Huang, S. (2017). Prediction of gene expression with cis-SNPs using mixed models and regularization methods. BMC Genom. 18: 368. https://doi.org/10.1186/s12864-017-3759-6.Suche in Google Scholar PubMed PubMed Central

Zhou, X., Carbonetto, P., and Stephens, M. (2013). Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 9: e1003264. https://doi.org/10.1371/journal.pgen.1003264.Suche in Google Scholar PubMed PubMed Central

Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44: 821–824. https://doi.org/10.1038/ng.2310.Suche in Google Scholar PubMed PubMed Central


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/sagmb-2021-0035).


Received: 2021-05-04
Revised: 2022-01-20
Accepted: 2022-02-06
Published Online: 2022-03-07

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 8.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/sagmb-2021-0035/html
Button zum nach oben scrollen