Abstract
Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.
Funding source: National Institutes of Health
Award Identifier / Grant number: R01GM114142
Funding source: National Center for Advancing Translational Sciences
Award Identifier / Grant number: Unassigned
Funding source: National Heart, Lung, and Blood Institute
Award Identifier / Grant number: Unassigned
Acknowledgment
The authors would like to acknowledge the Hoffman Family Center in Genetics and Epidemiology and the National Center for Advancing Translational Sciences (NCATS) for supporting the DHS study, and the National Heart, Lung, and Blood Institute (NHLBI) for the MESA data collection.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for theentire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: Research on this project is supported in part by National Institutes of Health (NIH) R01GM114142.
-
Data availability: The raw data can be obtained on request from the corresponding author.
References
Abegaz, F., Chaichoompu, K., Génin, E., Fardo, D.W., König, I.R., Mahachie John, J.M., and Van Steen, K. (2019). Principals about principal components in statistical genetics. Briefings Bioinf. 20: 2200–2216, https://doi.org/10.1093/bib/bby081.Search in Google Scholar PubMed
Albertsen, H.M., Chettier, R., Farrington, P., and Ward, K. (2013). Genome-wide association study link novel loci to endometriosis. PloS one 8: e58257, https://doi.org/10.1371/journal.pone.0058257.Search in Google Scholar PubMed PubMed Central
Balding, D.J. and Nichols, R.A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12, https://doi.org/10.1007/bf01441146.Search in Google Scholar PubMed
Bild, D.E., Bluemke, D.A., Burke, G.L., Detrano, R., Diez Roux, A.V., Folsom, A.R., Greenland, P., JacobsJr, D.R., Kronmal, R., Liu, K., et al.. (2002). Multi-ethnic study of atherosclerosis: objectives and design. Am. J. Epidemiol. 156: 871–881, https://doi.org/10.1093/aje/kwf113.Search in Google Scholar PubMed
Biswas, S. and Lin, S. (2012). Logistic Bayesian lasso for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics 68: 587–597, https://doi.org/10.1111/j.1541-0420.2011.01680.x.Search in Google Scholar PubMed
Bland, J.M. and Altman, D.G. (1995). Multiple significance tests: the Bonferroni method. BMJ 310: 170, https://doi.org/10.1136/bmj.310.6973.170.Search in Google Scholar PubMed PubMed Central
Brooks, S.P. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. J. Comput. Graph Stat. 7: 434–455, https://doi.org/10.2307/1390675.Search in Google Scholar
Burkett, K., Graham, J., and McNeney, B. (2006). hapassoc: software for likelihood inference of trait associations with snp haplotypes and other attributes. J. Stat. Software 16: 1–19, https://doi.org/10.18637/jss.v016.i02.Search in Google Scholar
Chen, H., Hao, Z., Zhao, Y., and Yang, R. (2020). A fast-linear mixed model for genome-wide haplotype association analysis: application to agronomic traits in maize. BMC Genom. 21: 1–9, https://doi.org/10.1186/s12864-020-6552-x.Search in Google Scholar PubMed PubMed Central
Datta, A.S. and Biswas, S. (2016). Comparison of haplotype-based statistical tests for disease association with rare and common variants. Briefings Bioinf. 17: 657–671, https://doi.org/10.1093/bib/bbv072.Search in Google Scholar PubMed PubMed Central
de Luis, D., Izaola, O., Primo, D., Gomez, E., Lopez, J.J., Ortola, A., and Aller, R. (2018). Association of a cholesteryl ester transfer protein variant (rs1800777) with fat mass, hdl cholesterol levels, and metabolic syndrome. Endocrinol. Diab. Nutr. 65: 387–393, https://doi.org/10.1016/j.endien.2018.07.002.Search in Google Scholar
Diao, G. and Lin, D.-y. (2020). Statistically efficient association analysis of quantitative traits with haplotypes and untyped snps in family studies. BMC Genet. 21: 1–11, https://doi.org/10.1186/s12863-020-00902-x.Search in Google Scholar PubMed PubMed Central
Grassmann, F., Heid, I.M., Weber, B.H., and IAMDGC, I.A.G.C. (2017). Recombinant haplotypes narrow the arms2/htra1 association signal for age-related macular degeneration. Genetics 205: 919–924, https://doi.org/10.1534/genetics.116.195966.Search in Google Scholar PubMed PubMed Central
Grindflek, E., Hansen, M.H., Lien, S., and van Son, M. (2018). Genome-wide association study reveals a qtl and strong candidate genes for umbilical hernia in pigs on ssc14. BMC Genom. 19: 1–9, https://doi.org/10.1186/s12864-018-4812-9.Search in Google Scholar PubMed PubMed Central
Guo, W. and Lin, S. (2009). Generalized linear modeling with regularization for detecting common disease rare haplotype association. Genet Epidemiol. 33: 308–316, https://doi.org/10.1002/gepi.20382.Search in Google Scholar PubMed PubMed Central
Hamazaki, K. and Iwata, H. (2020). Rainbow: haplotype-based genome-wide association study using a novel snp-set method. PLoS Comput. Biol. 16: e1007663, https://doi.org/10.1371/journal.pcbi.1007663.Search in Google Scholar PubMed PubMed Central
Hoffman, G.E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one 8: e75707, https://doi.org/10.1371/journal.pone.0075707.Search in Google Scholar PubMed PubMed Central
Holland, S.M. (2008). Principal components analysis (pca). Department of Geology, University of Georgia, Athens, GA, pp. 30602–32501.Search in Google Scholar
Hudson, R.R. (2002). Generating samples under a wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338, https://doi.org/10.1093/bioinformatics/18.2.337.Search in Google Scholar PubMed
Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S.-y., Freimer, N.B., Sabatti, C., and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42: 348–354, https://doi.org/10.1038/ng.548.Search in Google Scholar PubMed PubMed Central
Kettunen, J., Holmes, M.V., Allara, E., Anufrieva, O., Ohukainen, P., Oliver-Williams, C., Wang, Q., Tillin, T., Hughes, A.D., Kähönen, M., et al.. (2019). Lipoprotein signatures of cholesteryl ester transfer protein and hmg-coa reductase inhibition. PLoS Biol. 17: e3000572, https://doi.org/10.1371/journal.pbio.3000572.Search in Google Scholar PubMed PubMed Central
Lake, S.L., Lyon, H., Tantisira, K., Silverman, E., Weiss, S., Laird, N., and Schaid, D. (2003). Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum. Hered. 55: 56–65, https://doi.org/10.1159/000071811.Search in Google Scholar PubMed
Lawson, D.J., Davies, N.M., Haworth, S., Ashraf, B., Howe, L., Crawford, A., Hemani, G., Smith, G.D., and Timpson, N.J. (2020). Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? Hum. Genet. 139: 23–41, https://doi.org/10.1007/s00439-019-02014-8.Search in Google Scholar PubMed PubMed Central
Li, W., Liu, X., Huang, C., Liu, L., Tan, X., and Wang, X. (2020). The loss-of-function mutation of cetp affects hdlc levels but not apoa1 in patients with acute myocardial infarction. Nutr. Metabol. Cardiovasc. Dis. 31: 602–607.10.1016/j.numecd.2020.10.019Search in Google Scholar PubMed
Lin, W.-Y., Yi, N., Lou, X.-Y., Zhi, D., Zhang, K., Gao, G., Tiwari, H.K., and Liu, N. (2013). Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants. Genet Epidemiol. 37: 560–570, https://doi.org/10.1002/gepi.21740.Search in Google Scholar PubMed PubMed Central
Liu, Z., Turkmen, A., and Lin, S. (2023). Population stratification correction using Bayesian shrinkage priors for genetic association studies. Ann. Hum. Genet. 87: 302−315.10.1111/ahg.12527Search in Google Scholar PubMed
Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., et al.. (2009). Finding the missing heritability of complex diseases. Nature 461: 747–753, https://doi.org/10.1038/nature08494.Search in Google Scholar PubMed PubMed Central
Musunuru, K., Romaine, S.P., Lettre, G., Wilson, J.G., Volcik, K.A., Tsai, M.Y., Taylor, H.A.Jr, Schreiner, P.J., Rotter, J.I., Rich, S.S., et al.. (2012). Multi-ethnic analysis of lipid-associated loci: the nhlbi care project. PloS one 7: e36473, https://doi.org/10.1371/journal.pone.0036473.Search in Google Scholar PubMed PubMed Central
Nicoletti, P., Aithal, G.P., Bjornsson, E.S., Andrade, R.J., Sawle, A., Arrese, M., Barnhart, H.X., Bondon-Guitton, E., Hayashi, P.H., Bessone, F., et al.. (2017). Association of liver injury from specific drugs, or groups of drugs, with polymorphisms in hla and other genes in a genome-wide association study. Gastroenterology 152: 1078–1089, https://doi.org/10.1053/j.gastro.2016.12.016.Search in Google Scholar PubMed PubMed Central
Pirim, D., Wang, X., Radwan, Z.H., Niemsiri, V., Bunker, C.H., Barmada, M.M., Kamboh, M.I., and Demirci, F.Y. (2015). Resequencing of lpl in african blacks and associations with lipoprotein–lipid levels. Eur. J. Hum. Genet. 23: 1244–1253, https://doi.org/10.1038/ejhg.2014.268.Search in Google Scholar PubMed PubMed Central
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38: 904–909, https://doi.org/10.1038/ng1847.Search in Google Scholar PubMed
Price, A.L., Zaitlen, N.A., Reich, D., and Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11: 459–463, https://doi.org/10.1038/nrg2813.Search in Google Scholar PubMed PubMed Central
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J., et al.. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575, https://doi.org/10.1086/519795.Search in Google Scholar PubMed PubMed Central
Raftery, A.E., Gilks, W., Richardson, S., and Spiegelhalter, D. (1995). Hypothesis testing and model. In: Markov chain Monte Carlo in Practice. Chapman & Hall, Boca Raton, pp. 165–187.Search in Google Scholar
Raftery, A.E. and Lewis, S.M. (1995). The number of iterations, convergence diagnostics and generic metropolis algorithms. Pract. Markov Chain Monte Carlo 7: 763–773.Search in Google Scholar
Samedy, L.-A., Ryan, G.J., Superko, R.H., and Momary, K.M. (2019). Cetp genotype and concentrations of hdl and lipoprotein subclasses in african–american men. Future Cardiol. 15: 187–195, https://doi.org/10.2217/fca-2018-0058.Search in Google Scholar PubMed
Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., and Poland, G.A. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70: 425–434, https://doi.org/10.1086/338688.Search in Google Scholar PubMed PubMed Central
Trinder, M., Wang, Y., Madsen, C.M., Ponomarev, T., Bohunek, L., Daisely, B.A., Julia Kong, H., Blauw, L.L., Nordestgaard, B.G., Tybjærg-Hansen, A., et al.. (2021). Inhibition of cholesteryl ester transfer protein preserves high-density lipoprotein cholesterol and improves survival in sepsis. Circulation 143: 921–934, https://doi.org/10.1161/circulationaha.120.048568.Search in Google Scholar PubMed
Tzeng, J.-Y. and Bondell, H.D. (2010). A comprehensive approach to haplotype-specific analysis by penalized likelihood. Eur. J. Hum. Genet. 18: 95–103, https://doi.org/10.1038/ejhg.2009.118.Search in Google Scholar PubMed PubMed Central
Van Leeuwen, E.M., Huffman, J.E., Bis, J.C., Isaacs, A., Mulder, M., Sabo, A., Smith, A.V., Demissie, S., Manichaikul, A., Brody, J.A., et al.. (2015). Fine mapping the cetp region reveals a common intronic insertion associated to hdl-c. Aging Mech. Dis. 1: 1–9, https://doi.org/10.1038/npjamd.2015.11.Search in Google Scholar PubMed PubMed Central
Victor, R.G., Haley, R.W., Willett, D.L., Peshock, R.M., Vaeth, P.C., Leonard, D., Basit, M., Cooper, R.S., Iannacchione, V.G., Visscher, W.A., et al.. (2004). The dallas heart study: a population-based probability sample for the multidisciplinary study of ethnic differences in cardiovascular health. Am. J. Cardiol. 93: 1473–1480, https://doi.org/10.1016/j.amjcard.2004.02.058.Search in Google Scholar PubMed
Wang, M. and Lin, S. (2015). Detecting associations of rare variants with common diseases: collapsing or haplotyping? Briefings Bioinf. 16: 759–768, https://doi.org/10.1093/bib/bbu050.Search in Google Scholar PubMed PubMed Central
Weir, B. (1996). Genetic data analysis ii: Methods for discrete population genetic data. Sinauer Associates, Sunderland.Search in Google Scholar
Wojcik, G.L., Graff, M., Nishimura, K.K., Tao, R., Haessler, J., Gignoux, C.R., Highland, H.M., Patel, Y.M., Sorokin, E.P., Avery, C.L., et al.. (2019). Genetic analyses of diverse populations improves discovery for complex traits. Nature 570: 514–518, https://doi.org/10.1038/s41586-019-1310-4.Search in Google Scholar PubMed PubMed Central
Young, A.I. (2019). Solving the missing heritability problem. PLoS Genet. 15: e1008222, https://doi.org/10.1371/journal.pgen.1008222.Search in Google Scholar PubMed PubMed Central
Yuan, X. and Biswas, S. (2019). Bivariate logistic Bayesian lasso for detecting rare haplotype association with two correlated phenotypes. Genet. Epidemiol. 43: 996–1017, https://doi.org/10.1002/gepi.22258.Search in Google Scholar PubMed PubMed Central
Zhang, F. and Deng, H.-W. (2010). Confounding from cryptic relatedness in haplotype-based association studies. Genetica 138: 945–950, https://doi.org/10.1007/s10709-010-9476-6.Search in Google Scholar PubMed
Zhang, H. (2017). Detecting rare haplotype-environmental interaction and nonlinear effects of rare haplotypes using Bayesian LASSO on quantitative traits, PhD thesis. The Ohio State University.Search in Google Scholar
Zhang, Y. and Pan, W. (2015). Principal component regression and linear mixed model in association analysis of structured samples: competitors or complements? Genet. Epidemiol. 39: 149–155, https://doi.org/10.1002/gepi.21879.Search in Google Scholar PubMed PubMed Central
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/sagmb-2022-0034).
© 2024 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Empirically adjusted fixed-effects meta-analysis methods in genomic studies
- A CNN-CBAM-BIGRU model for protein function prediction
- A heavy-tailed model for analyzing miRNA-seq raw read counts
- Flexible model-based non-negative matrix factorization with application to mutational signatures
- Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data
- Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets
- A global test of hybrid ancestry from genome-scale data
- Integrative pathway analysis with gene expression, miRNA, methylation and copy number variation for breast cancer subtypes
- Bayesian LASSO for population stratification correction in rare haplotype association studies
Articles in the same Issue
- Frontmatter
- Research Articles
- Empirically adjusted fixed-effects meta-analysis methods in genomic studies
- A CNN-CBAM-BIGRU model for protein function prediction
- A heavy-tailed model for analyzing miRNA-seq raw read counts
- Flexible model-based non-negative matrix factorization with application to mutational signatures
- Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data
- Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets
- A global test of hybrid ancestry from genome-scale data
- Integrative pathway analysis with gene expression, miRNA, methylation and copy number variation for breast cancer subtypes
- Bayesian LASSO for population stratification correction in rare haplotype association studies