Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships

Gail Gong; Wei Wang; Chih-Lin Hsieh; David J. Van Den Berg; Christopher Haiman; Ingrid Oakley-Girvan; Alice S. Whittemore

doi:10.1515/sagmb-2018-0030

Article

Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships

Gail Gong , Wei Wang , Chih-Lin Hsieh , David J. Van Den Berg , Christopher Haiman , Ingrid Oakley-Girvan and Alice S. Whittemore

Published/Copyright: April 8, 2019

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 18 Issue 3

Abstract

Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.

Keywords: data-adaptive tests; multi-locus kernel tests; related subjects

Funding source: NIH

Award Identifier / Grant number: R01CA179011 and R01CA094069

Funding statement: This work was supported by NIH Funder Id: http://dx.doi.org/10.13039/100000002, grants R01CA179011 and R01CA094069.

Acknowledgment

We are grateful to the Multi-ethnic Cohort (MEC) investigators who made the MEC prostate cancer data available for analysis.

Appendices

A Alternate definitions of Q_B, Q_S and Q_T

Straightforward matrix multiplication based on equations (4–6) gives the following expressions for the three basic test statistics:

(10)QB=[∑m=1Mwm∑n=1N(yn−μn)gnm]2,QS=∑m=1M[wm(∑n=1N(yn−μn)gnm)]2,QT=∑m=1M[(wm∑n=1Nyngnm)2−(wm∑n=1Nμngnm)2].

These expressions show that Q_B sums phenotype-genotype products over markers before squaring the summations. Thus, as is well-known (e.g. Schaid et al., 2013), Q_B is vulnerable to power loss for target sets containing both positively and negatively trait-associated markers. Moreover, Q_B and Q_S contrast genotype similarities in pairs of subjects who share positive (or negative) trait residuals y_n – μ_n against similarities in residual-discordant pairs. Thus they are vulnerable to power loss when applied to marker sets whose trait-associated variants are all associated in the same direction. For example, for a binary trait with phenotypes y_n = 1 for affected subjects and y_n = 0 for unaffected subjects, expressions (11) show that Q_B and Q_S contrast genotype similarities in pairs of concordantly affected and concordantly unaffected subjects against genotype similarities in phenotype-discordant pairs. Thus they are vulnerable to power loss by genotype sharing in pairs of unaffected subjects, a particular problem for rare deleterious variants, when many pairs of controls share the wild-type allele. In contrast, Q_T compares observed genotype sharing by pairs of affected subjects with its null value based on the null predicted phenotypes μ_n, and can assume either positive or negative values. Simulations (data not shown) support these heuristic power comparisons by suggesting that when all trait-associated variants are positively associated, Q_T outperforms or matches both Q_B and Q_S for both binary and quantitative traits. However, when the trait-associated markers are both positively- and negatively-associated and the trait is binary, Q_S outperforms Q_T, which in turn outperforms Q_B.

B Optimizing the data-adaptive statistics

Here we describe the procedure used to determine the vector α_* = (α_*B, α_*S, α_*T) that maximizes the value X_α (G) = –log₁₀P(α;G), where P(α;G) is the nominal p-value of Q_α(G). The three pair-wise statistics Q_BS, Q_BT and Q_ST (which lie along the three edges of the triangle ℑ of Figure 1) can be expressed as weighted sums of the appropriate pair of basic statistics, with weights between zero and one. To optimize the weights for these statistics, GATARS uses a combination of golden section search and successive parabolic interpolation (Brent, 1973, p.10), implemented using the R function optimize. The SKAT-X statistic Q_BST is optimized using the box constraint method of Byrd et al. (1995). This quasi-Newton method, which is implemented using the R function optim with option “L-BFG-B”, transforms the weight vectors as α = (α_B, α_S, α_T) = (cos²θ₁, (sinθ₁ cosθ₂)², (sinθ₁ sinθ₂)²), with 0 ≤ θ₁ ≤ π/2 and 0 ≤ θ₂ ≤ π/2.

C Null distribution of statistics Q_α for fixed α

Under the null hypothesis, when α is fixed and the test statistic Q_α of (4) is conditioned on subjects’ phenotypes and covariates, it is asymptotically equivalent to a quadratic form in a Gaussian vector. The distribution of such quadratic forms is known and can be described as follows. Dropping the subscript α and focusing on the automsome, we rewrite (4) as

(11)Q=z~TA~z~,

with z~=V−1/2z and A~=V1/2AV1/2, where V is the nonsingular covariance matrix of z given by equation (8). We perform a singular value decomposition of A~ as

(12)A~=UTΛU,

where U is the (2M)×(2M) orthonormal matrix of eigenvectors of A~ and Λ is the (2M)×(2M) diagonal matrix whose diagonal entries are the eigenvalues λℓ of A~, ℓ=1, …, 2M. Substituting (13) into (12) yields

(13)Q=(Uz~)TΛUz≡xTΛx=∑ν=12Mλνxν2,

where x=Uz~=UV−1/2z has covariance equal to the identity matrix of dimension 2M. Thus under the null hypothesis, Q is a mixture of the independent variables xν2, ν=1, …, 2M, each with a limiting chi-squared distribution. We determined significance levels and critical points of its null distribution using the Davies exact method (Davies, 1980). We note for the SKAT statistic (given by QS=Qα with α=(0,0,1)), this limiting null distribution differs from that obtained when conditioning on genotypes and covariates and testing the null hypothesis (1); in this case the limiting null distribution of Q_S is the sum of N independent chi-squared distributions (Wu et al., 2011). The present method for obtaining significance levels also differs from that of Schaid et al. (2013), who, although testing the null hypothesis (2), estimate the limiting distribution of Q by a scaled chi-square distribution, with scale and degrees of freedom estimated by the (known) first two moments of Q.

D Selecting the null marker sets

In the kth of K iterations, we construct a set of null markers by randomly sampling one marker from each of M sampling sets S₁, …, S_M containing diallelic autosomal markers. Each sampling set S_m is chosen to contain at least 200 markers with empirical MAFs “matching” that of target marker m, i.e. lying in the interval (π_m(1 – ε), π_m(1 + ε)) for some user-specified value of ε, with 0 < ε ≪ 1 (we chose ε = 0.01). In addition, S_m contains no target markers or markers known to be associated with the trait. To construct these M sampling sets, GATARS uses the recombination hotspots identified by Myers et al. (2005) to partition the human autosome into 12,327 chromosome segments, with linkage equilibrium between pairs of markers in distinct segments. GATARS then deletes all segments containing target markers or markers known to be associated with the trait. The sampling sets S_m are obtained by choosing all markers on the remaining segments with frequencies matching the frequency π_m of marker m, m = 1, …, M.

For the data application, we identified 8515 chromosome segments without target markers or markers known to be associated with prostate cancer, and containing at least one marker with frequency matching one or more of the M = 24 target markers. These segments yielded potential sampling sets ranging in size from 225 to 31,562 markers; but for computational efficiency we limited their sizes to at most 1000 markers. For a given replication of a given simulation, we created a sampling set for each of the M = 20 target markers by randomly selecting 200 markers with matching MAFs from the simulated chromosomal segments without target markers.

References

Abecasis, G. R., W. O. Cookson and L. R. Cardon (2001): “The power to detect linkage disequilibrium with quantitative traits in selected samples,” Am. J. Hum. Genet., 68, 1463–1474.10.1086/320590Search in Google Scholar PubMed PubMed Central

Basu, S. and W. Pan (2011): “Comparison of statistical tests for disease association with rare variants,” Genet. Epidemiol., 35, 606–619.10.1002/gepi.20609Search in Google Scholar PubMed PubMed Central

Brent, R. (1973): Algorithms for minimization without derivatives, Prentice Hall, Englewood Cliffs, New Jersey, p. 10.Search in Google Scholar

Byrd, R. H., P. Lu, J. Nocedal and C. Zhu (1995): “A limited memory algorithm for bound constrained optimization,” SIAM J. Sci. Comput., 16, 1190–1208.10.2172/204262Search in Google Scholar

Chen, H., C. Wang, M. P. Conomos, A. M. Stilp, Z. Li, T. Sofer, A. A. Szpiro, W. Chen, J. M. Brehm, J. C. Celedon, S. Redline, G. J. Papanicolaou, T. A. Thornton, C. C. Laurie, K. Rice and X. Lin (2016): “Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models,” Am. J. Hum. Genet., 98, 653–666.10.1016/j.ajhg.2016.02.012Search in Google Scholar PubMed PubMed Central

Davies, R. B. (1980): “Algorithm AS 155: The distribution of a linear combination of chi-square random variables,” J. Royal Stat. Soc. Series C, 29, 323–333.10.2307/2346911Search in Google Scholar

Gallagher, D. J., M. M. Gaudet, P. Pal, T. Kirchhoff, L. Balistreri, K. Vora, J. Bhatia, Z. Stadler, S. W. Fine, V. Reuter, M. Zelefsky, M. J. Morris, H. I. Scher, R. J. Klein, L. Norton, J. A. Eastham, P. T. Scardino, M. E. Robson and K. Offit (2010): “Germline BRCA mutations denote a clinicopathologic subset of prostate cancer,” Clin. Cancer Res., 16, 2115–2121.10.1158/1078-0432.CCR-09-2871Search in Google Scholar PubMed PubMed Central

Haiman, C. A., Y. Han, Y. Feng, L. Xia, C. Hsu, X. Sheng, L. C. Pooler, Y. Patel, L. N. Kolonel, E. Carter, K. Park, L. Le Marchand, D. Van Den Berg, B. E. Henderson and D. O. Stram (2013): “Genome-wide testing of putative functional exonic variants in relationship with breast and prostate cancer risk in a multiethnic population,” PLoS Genet., 9, e1003419.10.1371/journal.pgen.1003419Search in Google Scholar PubMed PubMed Central

Hsieh, C. L., I. Oakley-Girvan, R. R. Balise, J. Halpern, R. P. Gallagher, A. H. Wu, L. N. Kolonel, L. E. O’Brien, I. G. Lin, D. J. Van Den Berg, C. Z. Teh, D. W. West and A. S. Whittemore (2001): “A genome screen of families with multiple cases of prostate cancer: evidence of genetic heterogeneity,” Am. J. Hum. Genet., 69, 148–158.10.1086/321281Search in Google Scholar PubMed PubMed Central

Ioannidis, N. M., J. H. Rothstein, V. Pejaver, S. Middha, S. K. McDonnell, S. Baheti, A. Musolf, Q. Li, E. Holzinger, D. Karyadi, L. A. Cannon-Albright, C. C. Teerlink, J. L. Stanford, W. B. Isaacs, J. Xu, K. A. Cooney, E. M. Lange, J. Schleutker, J. D. Carpten, I. J. Powell, O. Cussenot, G. Cancel-Tassin, G. G. Giles, R. J. MacInnis, C. Maier, C. L. Hsieh, F. Wiklund, W. J. Catalona, W. D. Foulkes, D. Mandal, R. A. Eeles, Z. Kote-Jarai, C. D. Bustamante, D. J. Schaid, T. Hastie, E. A. Ostrander, J. E. Bailey-Wilson, P. Radivojac, S. N. Thibodeau, A. S. Whittemore and W. Sieh (2016): “REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants,” Am. J. Hum. Genet., 99, 877–885.10.1016/j.ajhg.2016.08.016Search in Google Scholar PubMed PubMed Central

Kote-Jarai, Z., D. Leongamornlert, E. Saunders, M. Tymrakiewicz, E. Castro, N. Mahmud, M. Guy, S. Edwards, L. O’Brien, E. Sawyer, A. Hall, R. Wilkinson, T. Dadaev, C. Goh, D. Easton, UKGPCS Collaborators, D. Goldgar and R. Eeles (2011): “BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients,” Br. J. Cancer, 105, 1230–1234.10.1038/bjc.2011.383Search in Google Scholar PubMed PubMed Central

Kryukov, G. V., L. A. Pennacchio and S. R. Sunyaev (2007): “Most rare missense alleles are deleterious in humans: implications for complex disease and association studies,” Am. J. Hum. Genet., 80, 727–739.10.1086/513473Search in Google Scholar PubMed PubMed Central

Lee, S., M. J. Emond, M. J. Bamshad, K. C. Barnes, M. J. Rieder, D. A. Nickerson and X. Lin (2012a): “Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies,” Am. J. Hum. Genet., 91, 224–237.10.1016/j.ajhg.2012.06.007Search in Google Scholar PubMed PubMed Central

Lee, S., M. C. Wu and X. Lin (2012b): “Optimal tests for rare variant effects in sequencing association studies,” Biostatistics, 13, 762–775.10.1093/biostatistics/kxs014Search in Google Scholar PubMed PubMed Central

Li, B. and S. M. Leal (2008): “Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data,” Am. J. Hum. Genet., 83, 311–321.10.1016/j.ajhg.2008.06.024Search in Google Scholar PubMed PubMed Central

Liu, L., J. Lei and K. Roeder (2015): “Network assisted analysis to reveal the genetic basis of autism,” Ann. Appl. Stat., 9, 1571–1600.10.1214/15-AOAS844Search in Google Scholar PubMed PubMed Central

Madsen, B. E. and S. R. Browning (2009): “A groupwise association test for rare mutations using a weighted sum statistic,” PLoS Genet., 5, e1000384.10.1371/journal.pgen.1000384Search in Google Scholar PubMed PubMed Central

Makinen, V. P., M. Civelek, Q. Meng, B. Zhang, J. Zhu, C. Levian, T. Huan, A. V. Segrè, S. Ghosh, J. Vivar, M. Nikpay, A. F. Stewart, C. P. Nelson, C. Willenborg, J. Erdmann, S. Blakenberg, C. J. O’Donnell, W. März, R. Laaksonen, S. E. Epstein, S. Kathiresan, S. H. Shah, S. L. Hazen, M. P. Reilly, Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Consortium, A. J. Lusis, N. J. Samani, H. Schunkert, T. Quertermous, R. McPherson, X. Yang and T. L. Assimes (2014): “Integrative genomics reveals novel molecular pathways and gene networks for coronary artery disease,” PLoS Genet., 10, e1004502.10.1371/journal.pgen.1004502Search in Google Scholar PubMed PubMed Central

Manichaikul, A., J. C. Mychaleckyj, S. S. Rich, K. Daly, M. Sale and W. M. Chen (2010): “Robust relationship inference in genome-wide association studies,” Bioinformatics, 26, 2867–2873.10.1093/bioinformatics/btq559Search in Google Scholar PubMed PubMed Central

Morris, A. P. and E. Zeggini (2010): “An evaluation of statistical approaches to rare variant analysis in genetic association studies,” Genet. Epidemiol., 34, 188–193.10.1002/gepi.20450Search in Google Scholar PubMed PubMed Central

Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly (2005): “A fine-scale map of recombination rates and hotspots across the human genome,” Science, 310, 321–324.10.1126/science.1117196Search in Google Scholar PubMed

Neale, B. M., M. A. Rivas, B. F. Voight, D. Altshuler, B. Devlin, M. Orho-Melander, S. Kathiresan, S. M. Purcell, K. Roeder and M. J. Daly (2011): “Testing for an unusual distribution of rare variants,” PLoS Genet., 7, e1001322.10.1371/journal.pgen.1001322Search in Google Scholar PubMed PubMed Central

Nelson, M. R., D. Wegmann, M. G. Ehm, D. Kessner, St P. Jean, C. Verzilli, J. Shen, Z. Tang, S. A. Bacanu, D. Fraser, L. Warren, J. Aponte, M. Zawistowski, X. Liu, H. Zhang, Y. Zhang, J. Li, Y. Li, L. Li, P. Woollard, S. Topp, M. D. Hall, K. Nangle, J. Wang, G. Abecasis, L. R. Cardon, S. Zöllner, J. C. Whittaker, S. L. Chissoe, J. Novembre and V. Mooser (2012): “An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people,” Science, 337, 100–104.10.1126/science.1217876Search in Google Scholar PubMed PubMed Central

Park J. Y., C. Wu, S. Basu, M. McGue and W. Pan (2018): “Adaptive SNP-set association testing in generalized linear mixed models with application to family studies,” Behavior Genet., 48, 55–66.10.1007/s10519-017-9883-xSearch in Google Scholar PubMed PubMed Central

Petralia, F., W. M. Song, Z. Tu and P. Wang (2016): “New method for joint network analysis reveals common and different coexpression patterns among genes and proteins in breast cancer,” J. Proteome Res., 15, 743–754.10.1021/acs.jproteome.5b00925Search in Google Scholar PubMed PubMed Central

Price, A. L., G. V. Kryukov, P. I. de Bakker, S. M. Purcell, J. Staples, L. J. Wei and S. R. Sunyaev (2010): “Pooled association tests for rare variants in exon-resequencing studies,” Am. J. Hum. Genet., 86, 832–838.10.1016/j.ajhg.2010.04.005Search in Google Scholar PubMed PubMed Central

Pritchard, C. C., J. Mateo, M. F. Walsh, N. De Sarkar, W. Abida, H. Beltran, A. Garofalo, R. Gulati, S. Carreira, R. Eeles, O. Elemento, M. A. Rubin, D. Robinson, R. Lonigro, M. Hussain, A. Chinnaiyan, J. Vinson, J. Filipenko, L. Garraway, M. E. Taplin, S. AlDubayan, G. C. Han, M. Beightol, C. Morrissey, B. Nghiem, H. H. Cheng, B. Montgomery, T. Walsh, S. Casadei, M. Berger, L. Zhang, A. Zehir, J. Vijai, H. I. Scher, C. Sawyers, N. Schultz, P. W. Kantoff, D. Solit, M. Robson, E. M. Van Allen, K. Offit, J. de Bono and M. D. Nelson (2016): “Inherited DNA-Repair Gene Mutations in Men with Metastatic Prostate Cancer,” N. Engl. J. Med., 375, 443–453.10.1056/NEJMoa1603144Search in Google Scholar PubMed PubMed Central

Schaffner, S. F., C. Foo, S. Gabriel, D. Reich, M. J. Daly and D. Altshuler (2005): “Calibrating a coalescent simulation of human genome sequence variation,” Genome Res., 15, 1576–1583.10.1101/gr.3709305Search in Google Scholar PubMed PubMed Central

Schaid, D. J., S. K. McDonnell, J. P. Sinnwell and S. N. Thibodeau (2013): “Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data,” Genet. Epidemiol., 37, 409–418.10.1002/gepi.21727Search in Google Scholar PubMed PubMed Central

Stroup, W. W. (2012): “Generalized linear mixed models: modern concepts, methods and applications,” CRC Press, Boca Raton, Florida.Search in Google Scholar

Teng, J. and N. Risch (1999): “The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. II. Individual genotyping,” Genome Res., 9, 234–241.10.1101/gr.9.3.234Search in Google Scholar

Tennessen, J. A., A. W. Bigham, T. D. O’Connor, W. Fu, E. E. Kenny, S. Gravel, S. McGee, R. Do, X. Liu, G. Jun, H. M. Kang, D. Jordan, S. M. Leal, S. Gabriel, M. J. Rieder, G. Abecasis, D. Altshuler, D. A. Nickerson, E. Boerwinkle, S. Sunyaev, C. D. Bustamante, M. J. Bamshad, J. M. Akey, G. O. Broad and G. O. Seattle (2012): “Evolution and functional impact of rare coding variation from deep sequencing of human exomes,” Science, 337, 64–69.10.1126/science.1219240Search in Google Scholar PubMed PubMed Central

Thornton, T. and M. S. McPeek (2010): “ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure,” Am. J. Hum. Genet., 86, 172–184.10.1016/j.ajhg.2010.01.001Search in Google Scholar PubMed PubMed Central

Wang, K. (2016): “Boosting the Power of the Sequence Kernel Association Test by Properly Estimating Its Null Distribution,” Am. J. Hum. Genet., 99, 104–114.10.1016/j.ajhg.2016.05.011Search in Google Scholar PubMed PubMed Central

Wu, M. C., S. Lee, T. Cai, Y. Li, M. Boehnke and X. Lin (2011): “Rare-variant association testing for sequencing data with the sequence kernel association test,” Am. J. Hum. Genet., 89, 82–93.10.1016/j.ajhg.2011.05.029Search in Google Scholar PubMed PubMed Central

Yang, J., S. H. Lee, M. E. Godard and P. M. Visscher (2011): “GCTA: a tool for genome-wide complex trait analysis,” Am. J. Hum. Genet., 88, 76–82.10.1016/j.ajhg.2010.11.011Search in Google Scholar PubMed PubMed Central

Zhang Q., L. Wang, I. B. Boreki and M. A. Province (2014): “Adjusting family relatedness in data-driven burden tests of rare variants,” Genet. Epid., 38, 722–727.10.1002/gepi.21848Search in Google Scholar PubMed PubMed Central

Zhu, Y. and M. Xiong (2012): “Family-based association studies for next-generation sequencing,” Am. J. Hum. Genet., 90, 1028–1045.10.1016/j.ajhg.2012.04.022Search in Google Scholar PubMed PubMed Central

Published Online: 2019-04-08

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2018-0030

Keywords for this article

data-adaptive tests; multi-locus kernel tests; related subjects

Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships

Article

Abstract

Acknowledgment

Appendices

A Alternate definitions of QB, QS and QT

B Optimizing the data-adaptive statistics

C Null distribution of statistics Qα for fixed α

D Selecting the null marker sets

References

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue

A Alternate definitions of Q_B, Q_S and Q_T

C Null distribution of statistics Q_α for fixed α