Additive varying-coefficient model for nonlinear gene-environment interactions

Cen Wu; Ping-Shou Zhong; Yuehua Cui

doi:10.1515/sagmb-2017-0008

Article Open Access

Additive varying-coefficient model for nonlinear gene-environment interactions

Cen Wu , Ping-Shou Zhong and Yuehua Cui

Published/Copyright: February 8, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 17 Issue 2

Abstract

Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.

Keywords: B-spline; gene-set analysis; local quadratic approximation; SCAD penalty; variable selection

1 Introduction

Complex human diseases are determined not only by genetic variants, but may also be affected by environmental factors and the interplay between them. Changes in gene expression under different environmental conditions reveal the interaction between genes and the environment. These changes are less likely due to changes in the gene sequence itself, but to structural changes such as DNA methylation or histone modification that consequently play a regulatory role and modulate gene expression. Such epigenetic changes have been increasing recognized as the epigenetic basis of gene-environment (G×E) interaction (Liu, Li & Tollefsbol, 2008). Identification of G×E interaction could shed novel insights into the phenotypic plasticity of complex disease phenotypes (Feinberg, 2004).

In a typical G×E interaction study, the environmental factor can be either discrete or continuous. For example, smoking can be a discrete variable when evaluating the risk of asthma. When environmental variables are measured on a continuous scale, a clearer picture of the interaction can be assessed since the varying patterns of genetic effects responsive to environmental changes can be traced, leading to a better understanding of the genetic heterogeneity under different environmental stimuli (Ma et al., 2011; Wu & Cui, 2013). As illustrated in Wu and Cui (2013), one can assess the nonlinear G×E interaction when an environmental factor is measured in a continuous scale. For example, individual obesity can be a factor when evaluating the risk of hypertension. One can assess the nonlinear effect of a genetic factor on the risk of hypertension considering the heterogeneity of individual obese conditions in a population, leading to a better understanding of disease heterogeneity.

When assessing G×E interactions, investigators have focused predominantly on single variant based analysis, such as the parametric methods in Guo (2000), the semi-parametric methods in Chatterjee and Carroll (2005), Chen, Chatterjee, and Carroll (2013), and Maity et al. (2009), and the non-parametric methods in Ma et al. (2011) and Wu and Cui (2013). Recently, there has been a significant increase in set-based genetic association studies focusing on a set of variants, for example, the gene-centric analysis of Cui et al. (2008), the gene-set analysis of Schaid et al. (2012) and Efron and Tibshirani (2007), and the pathway-based analysis of Wang, Li, and Hakonarson (2011). By assessing the joint function of multiple variants in a set, one can obtain a better interpretation of the disease signals and gain novel insights into disease etiology. Motivated by these set-based association studies, we propose a set-based framework to investigate how variants in a gene-set moderated by an environment factor affect disease and in what form.

In a typical set-based association study, the number of variants d within a genetic system can be relatively large compared to the sample size, which makes the regression coefficients estimation instable. The problem can be approached from the perspective of variable selection. In this work, we extend our previous work on nonlinear gene-environment interaction from a single variant based analysis to a multiple variant based analysis under a penalized regression framework. We include variants that belong to a particular gene-set or pathway which potentially interact with one or multiple environment factors through an additive varying-coefficient model. We propose to select genetic variants with coefficient functions that are varying, non-zero constant and zero corresponding to cases with G×E interactions, no G×E interactions and no genetic effects, respectively. Our approach employs the power and merits of variable selection by simultaneously fitting all the variants in a genetic system into a regression model, therefore avoiding the limitation of multiple testing corrections, especially when the data dimension is large.

This paper is organized as follows. In Section 2, we describe the penalized least square estimation procedure via B-spline basis expansion and smoothly clipped absolute deviation (SCAD) penalty, as well as the computational algorithms. We also present the theoretical results including consistency in variable selection and show the optimal convergence rates of the estimates of varying effects. We show that the estimates of non-zero constant coefficients enjoy the oracle property in the sense that the asymptotic distribution of the non-zero constant coefficient function is the same as that when the true model is known a priori. The merit of the proposed method is demonstrated through extensive simulation studies in Section 3 and real data analysis in Section 4. The technical proofs are relegated to the A.

2 Methods

2.1 Additive varying-coefficient model with SCAD penalty

Throughout this paper, we assume an environment variable (Z) is continuously measured through which we can model the nonlinear interaction effect. For simplicity, we start the presentation with one environmental factor. Extension to multiple environmental factors are given in the end. Let (Xi,Yi,Zi), i = 1, …, n be independent and identically distributed (i.i.d.) random vectors, then the varying coefficient (VC) model, initially proposed by Hastie and Tibshirani (1993), has the form

(1) Y i = ∑ j = 0 d β j ( Z i ) X i j + ε i ,

where X_ij is the jth component of (d+1)-dimensional genetic vector X_i with the first component X_i0 being 1, βj(⋅)’s are unknown varying-coefficient functions, Z_i is the environmental variable, and ε_i is the random error such that E(εi|X,Z)=0 and Var(εi|X,Z)=σ2<∞. In the model, we assume there are a total of d genetic variants which are moderated by a common environmental factor Z.

The smooth functions {βj(⋅)}j=0d in (1) can be approximated by polynomial splines. Without loss of generality, suppose that Z ∈ [0, 1]. Let w_k be a partition of the interval [0,1], with k_n uniform interior knots

w k = { 0 = w k , 0 < w k , 1 < … < w k , k n < w k , k n + 1 = 1 } , for k = 0 , ⋯ , d .

Let Fn be a collection of functions on [0,1] satisfying: (1) the function is a polynomial of degree p or less on subintervals Is=[wk,s,wk,s+1),s=0,…,kn−1 and Ikn=[wj,kn,wj,kn+1); and (2) the functions are p − 1 times continuously differentiable on [0,1]. Let B¯(⋅)={B¯jl(⋅)}l=1Lj be a set of normalized B spline basis of Fn. Then for j = 0, …, d, the VC functions can be approximated by basis functions βj(Z)≈∑l=1Ljγ¯jlB¯jl(Z), where L_j is the number of basis functions in approximating the function βj(Z). By changing the equivalent basis, the basis expansion can be reexpressed as

β j ( ⋅ ) ≈ ∑ l = 1 L j γ j , l B j , l ( ⋅ ) ≐ γ j , 1 + B ~ j T ( ⋅ ) γ j , ∗ ,

where the spline coefficient vector γj=(γj,1,γj∗T)T, and B~j(⋅)=(Bj2(⋅),…,BjLj(⋅))T; γ_j,1 and γj∗ correspond to the constant and varying part of the coefficient function, respectively (Schumaker, 1981). We treat γj∗ as a group. If ‖γj∗‖2 = 0, then the jth predictor only has a non-zero constant effect; if γ_j,1 = 0, then the predictor is redundant.

To carry out variable selection separating the varying, non-zero constant, and zero effects, we minimize the penalized least square function,

(2) Q ( γ ) = 1 n ∑ i = 1 n [ Y i − ∑ j = 0 d ∑ l = 1 L γ j , l X i j B j l ( Z i ) ] 2 + ∑ j = 1 d p λ 1 ( ‖ γ j ∗ ‖ 2 ) + ∑ j = 1 d p λ 2 ( | γ j , 1 | ) I ( ‖ γ j ∗ ‖ 2 = 0 ) ,

where λ₁ and λ₂ are the penalization parameters, pλ(⋅) is the SCAD penalty function, defined as

(3) p λ ( u ) = { λ u if 0 ≤ u ≤ λ − ( u 2 − 2 a λ u + λ 2 ) 2 ( a − 1 ) if λ < u ≤ a λ ( a + 1 ) λ 2 2 if u > a λ . .

In matrix notation, (2) can be reexpressed as,

(4) Q ( γ ) = ( Y − U γ ) T ( Y − U γ ) / n + ∑ j = 1 d p λ 1 ( ‖ γ j ∗ ‖ 2 ) + ∑ j = 1 d p λ 2 ( | γ j , 1 | ) I ( ‖ γ j ∗ ‖ 2 = 0 ) ,

where Y=(Y1,…,Yn)T, γ=(γ0T,…,γdT)T, and U:=U(X,Z)=(U1T,…,UnT)T with Ui=(Xi0B(Zi)T, …,XidB(Zi)T)T. The first penalty function in (4) is to separate the varying and constant effects by penalizing the L₂ norm of the varying part of the coefficient functions. The indicator function in the 2nd penalty term helps to penalize the variables of the constant effects. Both γ_j,1 and γj∗ will be shrunk to zero if predictor X_j has no genetic effect. Since the indicator function in Q(γ) leads to much difficulty in optimizing the penalized loss function, we resort to the two stage iterative framework of great computational convenience described in 2.2. It can be shown that the estimator from the iterative procedure is asymptotically equivalent to the minimizer in (2) by the arguments in the proof of Theorem 1 and 2 in the A.

2.2 Computation algorithm

The SCAD penalty function is singular at the origin, and does not have continuous 2nd order derivatives, therefore the regular gradient-based optimization cannot be applied. In this section, we develop an iterative two-stage algorithm to minimize the penalized loss function using local quadratic approximation (LQA) to the SCAD penalty. The two-stage strategy was adopted in Tang et al. (2012) for penalized quantile regression with adaptive LASSO penalty. As in Fan and Li (2001), in a neighborhood of a given positive u0∈R+,

p λ ( u ) ≈ p λ ( u 0 ) + p λ ′ ( u 0 ) 2 u 0 ( u 2 − u 0 2 ) ,

where pλ′(u)=λ{I(u⩽λ)+(aλ−u)+(a−1)λI(u>λ)} for u>0 and a = 3.7. Here we use a similar quadratic approximation by substituting u with ‖γj∗‖2 and |γj1| in LQA, for j = 1, …, d. Given an initial value of γj0 such that ‖γj∗‖2≠0 and |γj1|≠0, we have

(5) p λ ( ‖ γ j ∗ ‖ 2 ) ≈ p λ ( ‖ γ j ∗ 0 ‖ 2 ) + p λ ′ ( ‖ γ j ∗ 0 ‖ 2 ) 2 ‖ γ j ∗ 0 ‖ 2 ( ‖ γ j ∗ ‖ 2 2 − ‖ γ j ∗ 0 ‖ 2 2 )

and

(6) p λ ( | γ j , 1 | ) ≈ p λ ( | γ j , 1 0 | ) + p λ ′ ( | γ j , 1 0 | ) 2 | γ j , 1 0 | ( | γ j , 1 | 2 − | γ j , 1 0 | 2 ) .

The sets of predictors with varying, non-zero constant, and zero effects are denoted by 𝒱, 𝒞 and 𝒵 respectively. We implement the iterative algorithm in the following two-stage procedure. In stage 1, using the LQA (5) and dropping the irrelevant constant terms, we minimize

(7) Q 1 ( γ ) = ( Y − U γ ) T ( Y − U γ ) + n 2 γ T Ω λ 1 ( γ 0 ) γ ,

where the initial spline vector 𝜸₀ is the unpenalized estimator, Ωλ1(γ0) = diag{Ω0,Ω1,…,Ωd}, where Ω0=0L, Ωj={0,pλ1T(‖γj∗0‖2)‖γj∗0‖2,…,pλ1T(‖γj∗0‖2)‖γj∗0‖2}L for j = 1, …, d. Hence the estimator can be iteratively obtained as

(8) γ ^ V C ( m ) = { U T U + n 2 Ω λ 1 ( γ ^ V C ( m − 1 ) ) } − 1 U T Y .

If all the predictors are in 𝒱 at the beginning, then the jth predictor will be moved to 𝒞 if ‖γ^j∗VC(m)‖2 = 0, otherwise it will stay in 𝒱.

In stage 2, using the LQA (6) and dropping the irrelevant constant terms, we minimize the following penalized loss only for the predictors in 𝒞,

(9) Q 2 ( γ ) = ( Y − U γ ) T ( Y − U γ ) + n 2 γ T Ω λ 2 ( γ ^ V C ) γ ,

where Ωλ2(γ^VC) = diag{Ω0,Ω1,…,Ωd} with Ω0=0L, Ωj={pλ2T(|γ^j,1VC|)|γ^j,1VC|I(‖γ^j∗VC‖2=0),0,…,0}L. The estimator can be iteratively obtained as

(10) γ ^ C Z ( m ) = { U T U + n 2 Ω λ 2 ( γ ^ C Z ( m − 1 ) ) } − 1 U T Y .

If the jth predictor is in 𝒞, then it will be moved to 𝒵 if |γ^k,1CZ|=0, otherwise it stays in 𝒞.

We can obtain the estimator γ^ at convergence from the iterative procedure between the two stages above, and the estimated coefficient function in (1) as β^j(z)=BT(z)γ^j. β^j(z) will be a varying function, non-zero constant and zero if γ^j is in 𝒱, 𝒞 and 𝒵 correspondingly.

2.3 Choosing the tuning parameters

We choose the number of interior knots k_n, the degree of the spline basis p, and the tuning parameters λ₁ and λ₂ by a data driven procedure. Here p and k_n control the smoothness of the coefficient functions, while λ₁ and λ₂ determine the threshold for variable selection. The Schwarz BIC criterion (1978) was used to choose k_n and p. Due to heavy computational costs, it becomes infeasible to simultaneously select p and k_n for each varying-coefficient function. Thus, we assume the same p and k_n for the varying-coefficient functions. The range for k_n is [max(⌊0.5n1(2p+3)⌋,1),⌊1.5n1(2p+3)⌋], where ⌊x⌋ denotes the integer part of x. The optimal pair of k_n and p can be selected via a two-dimensional grid search, according to the following criterion:

BIC k n , p = log(RSS k n , p ) + ( k n + p + 1 ) n log ( n ) ,

where RSSkn,p=(Y−Uγ^)T(Y−Uγ^), γ^=(γ^0T,0T,…,0T)T. Conditional on the selected k_n and p, λ₁ is the minimizer of

BIC λ 1 = log(RSS λ 1 ) + d f λ 1 n log ( n ) ,

where RSSλ1=(Y−Uγ^λ1)T(Y−Uγ^λ1), γ^λ1 is the minimizer of (7), and df_λ₁ is the effective degree of freedom, defined as the total number of predictors in 𝒱 and 𝒞.

Conditional on γ^λ1, λ₂ is the minimizer of

BIC λ 2 = log(RSS λ 2 ) + d f λ 2 n log ( n ) ,

where RSSλ2=(Y−Uγ^λ2)T(Y−Uγ^λ2), γ^λ2 is the minimizer of (9), and df_λ₂ is the effective degree of freedom, defined similarly as df_λ₁.

2.4 Asymptotic results

Here we establish the asymptotic properties of the penalized least square estimators. Without loss of generality, we assume there are v varying coefficients as βj(⋅)≡βj(z),j=1,…,v, (c − v) non-zero constant coefficients as βj(⋅)≡βj>0, j=v+1,…,c, and (d − c) zero coefficients as βj(⋅)≡0, j=(c+1),…,d. Our asymptotic results are based on the following assumptions.

Let Hr be the collection of all functions on the compact support [0,1] such that the r₁th order derivatives of the functions are Hölder of order r₂ with r = r₁ + r₂, i.e. |hr1(z1)−hr1(z2)|≤C0|z1−z2|r2 where 0≤z1,z2≤1 and C₀ is a finite positive constant. Then βj(z)∈Hr, j = 0, 1, …, v, for some r≥32.
The density function of the index variable Z, f(z), is continuous and bounded away from 0 and infinity on [0,1], i.e. there exist finite positive constants C₁ and C₂ such that C1≤f(z)≤C2 for all z ∈ [0, 1].
Let λ~0≤…≤λ~d be the eigenvalues of E[XXT|Z=z]. Assume that λ~_j (k = 0, …, d) are uniformly bounded away from 0 and infinity in probability. In addition, the random design vectors are bounded in probability.
For w_j, the partition of the compact interval [0,1] defined as {0=wj,0<wj,1<…<wj,kn<wj,kn+1=1}, j = 0, …, d, there exists a finite positive constant C₃ such that

max ( w j , k + 1 − w j , k , k = 0 , … , k n ) min ( w j , k + 1 − w j , k , k = 0 , … , k n ) ≤ C 3 .
The tuning parameters satisfy kn12max{λ1,λ2}→0 and n12kn−1min{λ1,λ2}→∞.
bn:=maxj{|pλ1′′(‖γ~j∗‖)|,|pλ2′′(|γ~j,1|)|:γ~j∗≠0,γ~j,1≠0}→0 as n → ∞, where γ~j is defined in the A.
lim infn→∞lim infθ→0+λ1−1pλ1′(θ)>0 and lim infn→∞lim infθ→0+λ2−1pλ2′(θ)>0

The above assumptions are commonly used in the literature on polynomial splines and variable selections. An assumption similar to (A1) is found in Kim (2007) and Tang et al. (2012). (A1) guarantees certain degrees of smoothness of the true coefficient function in order to improve goodness of approximation. (A2) and (A3) are similar to those in Huang et al. (2002, 2004) and Wang, Li, and Huang (2008). (A4) suggests that the knot sequence is quasi-uniform on [0,1], as in Schumaker (1981). (A5–A7) are conditions on tuning parameters, of which (A5) was reported by Tang et al. (2012) while (A6) and (A7) are similar to those in Fan and Li (2001) and Wang, Li, and Huang (2008).

Theorem 1

Under the assumptions (A1–A7) and suppose kn=O(n12r+1), then we have

(1) β^j(z) are nonzero constant, j=v+1,…,c and β^j(z)=0, j=c+1,…,d, with probability approaching 1;

(2) |β^j(z)−βj(z)|=Op(n−r2r+1), j = 0, …, v for any fixed z.

The proof can be found in A. Denote β∗=(βv+1,…,βc)T as the vector of true nonzero constant coefficients. The following theorem establishes the asymptotic normality of β^∗.

Theorem 2

Under the assumptions (A1–A7) and suppose kn=O(n12r+1), then as n → ∞,

n Σ 1 2 ( β ^ ∗ − β ∗ ) → d N ( 0 , σ 2 I c − v ) ,

where Σ is defined as 22 in A, and σ2=E(εi2).

3 Simulation

The performance of the proposed method was demonstrated through extensive simulation studies. We used the percentage of choosing the true model out of total R replicates, defined as the oracle percentage, to evaluate the accuracy of variable selection by identifying varying, non-zero constant and zero effects. The precision of estimation was assessed by integrated mean squared error (IMSE). Let β^j(r) be the estimator of a nonparametric function β_j in the rth (1 ⩽ r ⩽ R) replication, and {zm}m=1ngrid be the grid points where β^j(r) was evaluated. We used the integrated mean squared error (IMSE) of β^k(z), defined as

I M S E ( β ^ j ( z ) ) = 1 R ∑ r = 1 R 1 n grid ∑ m = 1 n grid { β ^ k ( r ) ( z m ) − β j ( z m ) } 2 ,

to evaluate the estimation accuracy of coefficient β_j, and the total integrated mean squared error (TIMSE) of all the d coefficients, defined as TIMSE=∑j=1dβ^j(z), to evaluate the overall estimation accuracy. Note that IMSE(β^_j) is reduced to MSE(β^_j) when β^_j is a constant. The percentage of correctly selecting each individual true functions (defined as the selection ratio) was used to evaluate the selection performance.

We considered multiple genetic factors X obtained from a gene-set or pathway, with the following additive VC model,

Y i = β 0 ( Z i ) + ∑ j = 1 d β j ( Z i ) X i j + ε i ,

where SNP X_i’s were coded with 3 categories (1, 0, −1) for genotypes (AA, Aa, aa) respectively. We simulated the SNP genotype data based on the pairwise linkage disequilibrium(LD) structure. Suppose the two risk alleles A and B of two adjacent SNPs have the minor allele frequencies (MAFs) p_A and p_B, respectively, with LD denoted as δ. Then the frequencies of four haplotypes can be expressed as pab=(1−pA)(1−pB)+δ, pAb=pA(1−pB)−δ, paB=(1−pA)pB−δ, and pAB=pApB+δ. Assuming Hardy-Weinberg equilibrium, the SNP genotype at locus 1 can be simulated assuming a multinomial distribution with frequencies pA2, 2pA(1−pA) and (1−pA)2 for genotypes AA, Aa, aa, respectively. We can then simulate genotype for locus 2 based on the conditional probability. For example, P(BB|AA)=pAB2/pAA, P(Bb|AA)=pABpAb/pAA and P(bb|AA)=pab2/pAA. So conditional on genotype AA at locus 1, the genotype at locus 2 can be generated according to a multinomial distribution with the derived probabilities. The advantage of this simulation is that we can control the pairwise LD structure between adjacent SNPs. We assumed pairwise correlation of r = 0.5 which leads to δ=r(pA(1−pA)pB(1−pB)). To save space, we omitted the detailed simulation information which can be found in Cui et al. (2008). The coefficient functions were set as: β1(z)=sin⁡(2πz), β2(z)=2−3cos⁡{(6z−5)π/3}, β3(z)=3(2z−1)3, β4(z)=2, β5(z)=2.5, and βj(z)=0 for j > 5. We evaluated the performance under n = 500 with 500 replicates. Better performance results for large samples (n > 500) were observed, but were omitted to save space.

Figure 1 shows the selection ratio when d = 10, under different combinations of MAF and error distribution. The height of the bars represents the selection percentage out of 500 replicates. The selection performance is better under the normal error distribution, with relatively higher selection rate for the first five true functions and lower false selection ratio for the rest, compared to the results obtained under the t(3) error. In genetic association studies, model performance generally improves as the MAF increases. The same trend is observed under our variable selection framework. For example, a higher false selection ratio was observed under the t(3) error when p = 0.1. The false selection ratio decreases as MAF increases to 0.3. The result for d = 50 is presented in Figure 2, which shows a very similar pattern. The results demonstrate the stable performance of the proposed variable selection method.

Figure 1:

The selection ratio under different error distributions for different coefficient functions when d = 10. The horizontal axis represents the SNPs.

Table 1 lists the oracle percentage (%) of choosing the true model out of all the simulation replicates, the IMSE (inside the panel), and TIMSE (the last row) for the case with d = 10. In general, the model selection performance improves as the MAF increases from 0.1 to 0.5. For example, the oracle percentage increases from 0.72 to 0.91 under the t(3) error with SCAD penalty, when the MAF increases from 0.1 to 0.3. We observed dramatic reduction on the IMSE and TIMSE as the MAF increases. Under the normal error, the TIMSE is 0.4205 which reduces to 0.2007 when the MAF increases to 0.3 and further reduces to 0.1895 when p = 0.5. This result is consistent with the general observation in a genetic association study in which typically a model performs better as the MAF increases. It is worth mentioning that we observed dramatic improvement in model performance when the MAF increases from 0.1 to 0.3, compared to the improvement when the MAF increases from 0.3 to 0.5. For example, the IMSE for β2(u) reduces from 0.3285 to 0.1600, a 51% reduction when p increases from 0.1 to 0.3, while there is only a 1% reduction when p increases from 0.3 to 0.5 under the t(3) error distribution for the SCAD penalty. This empirical observation shows the stable performance of the model under moderate allele frequency.

Table 1:

List of IMSE, TIMSE, and Oracle percentage (%) under 𝒩(0, 1) and t(3) error distributions when d = 10.

	p = 0.1				p = 0.3				p = 0.5
	𝒩(0,1) error		t(3) error		𝒩(0,1) error		t(3) error		𝒩(0,1) error		t(3) error
	SCAD	Oracle²	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle
Oracle %¹	0.976	1	0.72	1	0.992	1	0.91	1	0.98	1	0.894	1
β 1 ( u )	0.0863	0.0891	0.3078	0.2247	0.0268	0.0273	0.0607	0.0601	0.0213	0.0214	0.0431	0.0451
β 2 ( u )	0.1611	0.1667	0.3285	0.3557	0.1071	0.1174	0.1600	0.1746	0.1044	0.1106	0.1581	0.1725
β 3 ( u )	0.1264	0.1238	0.4890	0.2932	0.0561	0.0637	0.1360	0.1320	0.0497	0.0604	0.1101	0.1170
β 4 ( u )	0.0270	0.0192	1.3307	0.0643	0.0086	0.0084	0.1111	0.0237	0.0077	0.0077	0.0439	0.0192
β 5 ( u )	0.0191	0.0174	0.2943	0.0475	0.0066	0.0065	0.0443	0.0222	0.0063	0.0063	0.0240	0.0135
TIMSE	0.4205	0.4162	2.9342	0.9855	0.2007	0.2233	0.5311	0.4126	0.1895	0.206	0.4072	0.3673

¹Oracle % refers to the percentage of selecting all variables that are used to generate the phenotype Y;
²Oracle refers to the oracle IMSE, that is, the IMSE calculated assuming that we know the true regression model.

Another observation from the simulation is that the model performs better under the normal error than under the t(3) error. We observed a larger oracle percentage, smaller IMSE and TIMSE for the coefficient functions under the normal error compared to the t(3) error. For example, the TIMSE for the SCAD penalty is 0.4205 under the normal error, while it is 2.9342 under the t(3) error for fixed p = 0.1. In addition, the oracle percentage, IMSE and TIMSE under the normal error are all quite similar as those obtained as if the truth were known (the oracle) in all cases, demonstrating the stable selection performance of the SCAD penalty.

A similar pattern was observed when the data dimension increases to 50 (Table 2). As the MAF increases from 0.1 to 0.3, we observed sharply decreased IMSE and TIMSE. Compared to the low dimensional case when d = 10, the performance under p = 0.1 is relatively unstable. For example, the TIMSE for the SCAD method is 3.3644 when d = 50, compared to 0.4205 when d = 10 under the normal error and p = 0.1. However, we observed dramatic reduction in TIMSE when the MAF increases to 0.3 under d = 50. Thus, one has to be very careful about the interpretation of the selection result under low MAF in real data analysis. We did additional simulations when the sample size increases to 1000 and observed consistently improved results under different scenarios (data not shown). In summary, the SCAD penalty function shows consistently good performance and can separate varying, constant and zero effects under moderate allele frequencies. Coupled with the results shown in Figure 1 and Figure 2, the proposed variable selection method shows relatively stable performance to assess gene-environment interactions.

Table 2:

List of IMSE, TIMSE, and Oracle percentage (%) under 𝒩(0, 1) and t(3) error distributions when d = 50.

	p = 0.1				p = 0.3				p = 0.5
	𝒩(0,1) error		t(3) error		𝒩(0,1) error		t(3) error		𝒩(0,1) error		t(3) error
	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle	SCAD	Oracle
Oracle %	0.908	1	0.435	1	0.986	1	0.745	1	0.988	1	0.87	1
β 1 ( u )	0.1929	0.0884	0.5687	0.2209	0.0289	0.0278	0.0860	0.0599	0.0215	0.0216	0.0450	0.0434
β 2 ( u )	0.2064	0.1684	0.3851	0.3340	0.1107	0.1137	0.1858	0.1742	0.1048	0.1123	0.1551	0.1608
β 3 ( u )	0.5235	0.1218	0.6934	0.2614	0.0817	0.0646	0.2205	0.1301	0.0608	0.0579	0.1754	0.1085
β 4 ( u )	2.0918	0.0196	2.4522	0.0484	0.1083	0.0075	0.3865	0.0254	0.0470	0.0078	0.1681	0.0167
β 5 ( u )	0.3475	0.0158	0.5996	0.0445	0.0229	0.0068	0.0840	0.0220	0.0120	0.0053	0.0480	0.0190
TIMSE	3.3644	0.4140	5.7021	0.9092	0.3526	0.2204	1.2288	0.4117	0.2461	0.2050	0.6492	0.3484

Figure 2:

The selection ratio under different error distributions for different coefficient functions when d = 50. The horizontal axis represents the SNPs.

To further assess the false positive controls of the proposed method, we generated the response from the intercept only model, i.e. Yi=β0(Zi)+εi. There are no main and interaction effects associated with the disease phenotype. The average number of false positive effects for (d,error)=(10,N(0,1)),(10,t(3)),(50,N(0,1)),(50,t(3))) setups are 0.004, 0.042, 0.002 and 0.036, respectively. Overall, the proposed method achieves satisfactory false positive controls under the null model.

4 Case study

The body mass index of the mother (MBMI) is often used as a measure of the mothers’ body shape and degree of obesity. Since the baby resides inside its mother’s womb, its environment is defined through its mother. Increasing evidence indicates that both pre-pregnant weight (BMI) and weight gain in pregnancy have a major influence on babies’ birth weight (Stamnes Koepp et al., 2012). Due to the complicated interaction between the genes of the fetus and the mother’s level of obesity, the birth weight might be different for a fetus with the same genes but under different environment conditions. Thus, variation in birth weight could be totally or partially explained by the underlying genetic machinery and how those genes respond to the mother’s obesity to affect birth weight.

We applied the method to a real dataset from a study conducted in the Department of Obstetrics and Gynecology at Sotero del Rio Hospital in Puente Alto, Chile. The initial objective of the study was to pinpoint genetic variants associated with a binary response indicating large for gestational age (LGA) or small for gestational age (SGA) infants based on the birthweight of new born babies. After data cleaning by removing SNPs with MAF less than 0.05 or deviation from the Hardy-Weinberg equilibrium, the dataset contains 1536 new born babies genotyped with 189 candidate genes covering 660 single nucleotide polymorphisms (SNPs).

Genes were mapped to the KEGG pathway using the GATHER software which can be accessed at http://gather.genome.duke.edu. A total 30 pathways based on 189 candidate genes were retrieved. We treated the mother’s BMI as the environmental factor and the baby’s birth weight as the response variable; this was standardized before fitting to the model. Since some genes were mapped to multiple pathways, we did the variable selection for each pathway separately. Table 3 shows the selection results with SNP ID, the gene and pathway name the SNP(s) belong(s) to and the selected effect. Two SNPs in gene IL2 were mapped to two pathways and both SNPs consistently show varying effects in the two pathways. SNP rs2069762 in gene IL2 was previously reported to be associated with preterm birth and low birthweight in a Japanese population study (Sata F et al., 2009). Several other SNPs in gene IL1B were also reported to be associated with low birthweight in that paper. In addition, one SNP in gene IL1B in the Toll-like receptor signalling pathway was selected as a varying effect. Two SNPs in gene COL1A2 were mapped to two pathways and both were selected as varying effects. SNP rs997049 in gene IL1R1 was selected as a constant effect in two different pathways.

Table 3:

List of selected SNPs in each pathway with constant and varying coefficients.

Pathway (# of genes)(# of SNPs)	SNP ID	Gene	Selected Effect
Cytokine-cytokine receptor interaction(45)(123)	rs2069762	IL2	varying
	rs2069772	IL2	varying
	rs997049	IL1R1	constant
Complement and coagulation cascades(18)(53)	rs2053044	ADRB2	constant
Jak-STAT signaling pathway(24)(65)	rs2069762	IL2	varying
	rs2069772	IL2	varying
ECM-receptor interaction pathway(15)(95)	rs2301643	COL1A2	varying
	rs13240759	COL1A2	varying
Toll-like receptor signaling pathway(15)(21)	rs3136558	IL1B	constant
Focal adhesion(21)(109)	rs2301643	COL1A2	varying
	rs13240759	COL1A2	varying
Apoptosis(8)(20)	rs997049	IL1R1	constant
Glycolysis/Gluconeogenesis(1)(2)	rs10891315	DLAT	constant
Pyruvate metabolism(1)(2)	rs10891315	DLAT	constant

Figure 3 plots the varying coefficient function for two SNPs, SNP rs2039762 in the Cytokine-cytokine receptor interaction pathway and SNP rs2301643 in the ECM-receptor interaction pathway. The varying pattern of the function over mother’s BMI indicates the nonlinear interaction of the SNPs with mother’s BMI condition to affect birth weight. When fitting a linear interaction model, no SNPs show significant interaction with mother’s BMI (data not shown).

Figure 3:

The estimated varying coefficient function for SNP rs2039762 in the Cytokine-cytokine receptor interaction pathway and SNP rs2301643 in the ECM-receptor interaction pathway.

5 Discussion

The significance of G×E interactions in complex human disease traits has stimulated widespread discussion. As reviewed in Cornelis et al. (2011), a number of statistical models have been proposed to assess gene effect under different environmental exposures. The success of gene set based association analysis, as shown in Wang, Li, and Hakonarson (2011), Cui et al. (2008), Wu and Cui (2013), and Schaid et al. (2012), motivated us to propose a high dimensional variable selection approach to understand the mechanism of G×E interactions associated with complex diseases. We adopted a penalized regression method within the VC model framework to investigate how multiple variants within a genetic system are moderated by environmental factors to influence the phenotypic response.

Within the model-based regression framework, most G×E interactions are modeled via a product term between a G and an E variable (Hutter et al., 2013), so the contribution of a genetic variant to the phenotypic variation is considered as a linear function in the environmental factor. Any non-linear interaction can be pursued to relax the linearity assumption (Ma et al., 2011; Wu & Cui, 2013). As pointed out by one reviewer, statistical interactions introduced by R.A. Fisher, are defined as deviation from a generalized linear model, which implicitly suggests a nonlinear relationship and is more general. To avoid confusion with the nonlinear G×E interaction presented in this work, we make it clear that our nonlinear G×E interaction refers to the effect of a genetic variant assessed as a nonlinear function of an environment variable.

In a G×E study, people are typically interested in assessing variants which are sensitive to environment changes and those that are not. We can determine if a particular genetic variant is sensitive to environmental stimuli by examining the status of the coefficient function. Varying-coefficients and constants can be separated through B-spline basis expansions under a penalized framework. The varying coefficients correspond to G×E effects and the constant effects correspond to no interaction effects. Through another penalty function, we can further shrink the constant effect into zero if the corresponding SNP has no genetic effect. We developed a two-stage iterative estimation procedure with double SCAD penalty functions. Although the two-stage strategy has been adopted for regularized quantile regression with adaptive LASSO in Tang et al. (2012), our work significantly differs in that we focused on the regularized least square regression and rigorously establish the asymptotic properties of the nonconvex double SCAD estimator under suitable regularity conditions. The potential of non-convex penalty functions in investigating G×E interactions is far from fully understood or explored. As a representative non-convex penalty function, the SCAD is adopted mainly due to its nice oracle properties as stepping stones for building statistically sound and practically useful models to accommodate more complex data structures. It is worth to mention that our method is fundamentally different from the work of Xue and Qu (2012) and Antoniadis, Gijbels, and Lambert-Lacroix (2014) in which the authors developed a variable selection framework under the additive VC model to distinguish zero vs varying coefficients. They did not distinguish non-zero constant vs varying coefficients which is one of the key objects in understanding the mechanisms of G×E interaction. Identification of the constant coefficients in the varying coefficient models is closely related to the estimation of linear part in additive models. This line of work includes Hu and Xia (2012) and Zhang, Cheng, and Liu (2011). None of the existing studies closely explore the automatic structure identification and separation of different effects under the G×E framework.

The current work only demonstrates the case with one environmental factor. It is broadly recognized that the etiology of many complex disease is less likely to be affected by one environmental factor but is more likely to be heterogeneous. When multiple continuously measured environmental factors (say K₁) are measured (denoted as Z₁), we can extend the current model to a more general case formulated as follows,

Y = ∑ j = 0 d { ∑ k = 1 K 1 β k j ( Z 1 k ) } X j + ε ,

where X₀ = 1. The same estimation and variable selection framework can be applied to select important genetic players that show sensitivity to different environmental stimuli. When discrete environmental variables such as smoking status are also available, denote Z₂ as a collection of K₂ such variables, then we can fit the following model

Y = ∑ j = 0 d { ∑ k = 1 K 1 β k j ( Z 1 k ) + ∑ l = 1 K 2 α l j Z 2 l } X j + ε ,

the partial linear varying-coefficient model. In addition to the two penalty functions specified in this work, an additional penalty function should be imposed for {α}lj to select important variants showing interaction with Z₂. In case of a binary response, we are interested in modeling E[Y|Z1,Z2,X]=∑j=0d{∑k=1K1βkj(Z1k)+∑l=1K2αljZ2l}Xj. We will investigate this in future studies.

The proposed method is not only restricted to quantitative phenotypes and can be extended to other types of phenotypes. For example, in cancer prognostic studies, it can be modified as the Accelerated failure time (AFT) model to accommodate the survival outcomes. Binary phenotypes significantly differ from quantitative and survival outcomes in that they contain much less information, hence the accuracy of estimating nonlinear interactions might be sacrificed. Nevertheless, extension to the binary case can be done by developing a coordinate descent (CD) based iteratively reweighted least squares (IRLS) algorithm under the regularized logistic regression framework. The CD based IRLS algorithm have been extensively used to extend regularized variable selection methods from continuous phenotypes to binary phenotypes such as in case control studies.

In the model, we did not include any covariates. However, the proposed varying coefficient model can be readily modified to allow for covariate effects. Typically, the covariates included in the model are predetermined as important ones and are in low dimension, so their effects are not subject to penalization. Assuming there are no interactions between genes and those covariates (those with interactions will be included in the model), one can fit a regression model by regressing Y against those covariates only, assuming either linear on nonlinear effects, Then focusing the obtained residuals (after removing the covariates effects) to do the rest of the analysis by fitting the models described in this paper. It is also worth mentioning that the real data analysis in this work does not take other covariates (e.g. gender and mother’s gestational age) effects into account, which may lead to biased results. Due to this limitation, readers should be cautious when interpreting the real data analysis results.

Acknowledgements

The authors wish to thank two anonymous referees for their constructive comments that greatly improved the manuscript. We would also like to thank Dr. L. Wang for insightful discussions on parameter estimation and Dr. R. Romero for sharing the birthweight data. This work was supported in part by grants from National Natural Science Foundation of China (31371336) and National Science Foundation (IOS-1237969), and by an Innovative Research Award from the Johnson Cancer Research Center at Kansas State University.

Appendix

Technical Proofs

Useful notations and lemmas

For convenience, the following notations are adopted :

γ(v)=(γ0T,…,γvT)T, γ(c)=(γv+1T,…,γcT)T,

γ(d)=(γv+1,1T,…,γd,1T)T, γ~(v)=(γ~0T,…,γ~vT)T,

γ~(c)=(γ~v+1T,…,γ~cT)T, γ~(d)=(γv+1,1,…,γd,1)T,

Gn=(B(z1),…,B(zn))(B(z1),…,B(zn))T,

ε=(ε1,…,εn)T, Φn=n−1∑i=1nU(v)iU(v)iT,

Ψn=n−1∑i=1nU(v)iU(c)iT, Λi=U(c)i−ΨnTΦn−1U(v)i,

where U(v) and U(c) are the sub design matrices corresponding to the predictors with varying and nonzero constant coefficients respectively. We use ‖∙‖ to denote the L₂ norm ‖∙‖2 in the A.

We first provide several lemmas necessary for the proofs of Theorems 1 and 2. Lemma 1 follows directly from the proof of Lemma A.3 in Huang, Wu, and Zhou (2004), and Lemma 2 follows from Corollary 6.21 of Schumaker (1981).

Lemma 1

Under assumptions (A1–A3), there exists finite positive constants C₁ and C₂ such that all the eigenvalues of (kn/n)Gn fall between C₁ and C₂, and therefore, G_n is invertible.

Lemma 2

Under assumptions (A1–A3), for some finite constant C₃, there exists γ~=(γ~0T,…,γ~dT)T satisfying

‖γ~j∗‖>C3, j = 0, …, v; γ~j1=βj, ‖γ~j∗‖=0, j=v+1,…,c; γ~j=0, j=c+1,…,d;
supz∈[0,1]|βj(z)−B(z)Tγ~j|=O(kn−r), j = 0, …, d, where γ~j=(γ~j,1,γ~j∗T)T;
sup(z,x)∈[0,1]×Rd+1|xTβ(z)−U(x,z)Tγ~|=O(kn−r).

Proofs of Theorem 1

(I) Proof of Theorem 1(1), part 1

Here we first show β^j(z) is constant for j=v+1,…,d with probability approaching 1 as n → ∞, which amounts to demonstrating ‖γ^j∗vc‖=0, j=v+1,…,d with probability tending to 1, as n → ∞. To this end, we first show that a minimizer γ^VC of Q1(γ) exists in a neighborhood of γ~ where

(11) Q 1 ( γ ) = ∑ i = 1 n ( Y i − U i T γ ) 2 + n ∑ j = 1 d p λ 1 ( ‖ γ j ∗ ‖ ) .

Let αn=n−12kn+an, where an:=maxj{|pλ1′(‖γ~j∗‖)|,|pλ2′(|γ~j,1|)|:γ~j∗≠0,γ~j,1≠0}. The property of SCAD penalty function implies that if max{λ1,λ2}→0, a_n = 0. We show that for any given ε > 0, there exists a large constant C such that

(12) P { inf ‖ δ ‖ = C Q 1 ( γ ^ V C ) ≥ Q 1 ( γ ~ ) } ≥ 1 − ε ,

where γ^vc=γ~+αnδ. This suggests that with probability at least 1 − ε there exists a local minimum in the ball {γ~+αnδ:‖δ‖≤C}. Hence, there exists a local minimizer such that ‖γ^vc−γ~‖=Op(αn). A direct computation yields

D n ( δ ) = Q 1 ( γ ^ v c ) − Q 1 ( γ ~ ) = − 2 α n ∑ i = 1 n [ ε i + X i T r ( z i ) ] U i T δ + α n 2 ∑ i = 1 n U i T δ δ T U i + n ∑ j = 1 d [ p λ 1 ( ‖ γ ^ j ∗ v c ‖ ) − p λ 1 ( ‖ γ ~ j ∗ ‖ ) ] := Δ 1 + Δ 2 + Δ 3

where rj(z)=B(z)Tγ~j−βj(z), j = 1, …, d and r(z)=(r1(z),…,rd(z))T. By the fact E(εi|Ui,zi)=0, we obtain that

1 n ∑ i = 1 n ε i U i T δ = O p ( ‖ δ ‖ ) .

Recall Lemma 2, then

1 n ∑ i = 1 n X i T r ( z i ) U i T δ = O p ( k n − r ‖ δ ‖ ) .

Therefore

Δ 1 = O p ( n α n ‖ δ ‖ ) + O p ( n k n − r α n ‖ δ ‖ ) = O p ( n k n − r α n ) ‖ δ ‖ .

We can also show that Δ2=Op(nαn2)‖δ‖2. Then, by choosing a sufficiently large C, Δ₁ is dominated by Δ₂ uniformly in ‖δ‖=C. It follows from Taylor expansion that

Δ 3 ≤ n ∑ j = 1 d [ α n p λ 1 ′ ( ‖ γ ~ j ∗ ‖ ) γ ~ j ∗ ‖ γ ~ j ∗ ‖ ‖ δ j ∗ ‖ + α n 2 p λ 1 ′ ′ ( ‖ γ ~ j ∗ ‖ ) ‖ δ j ∗ ‖ 2 ( 1 + o p ( 1 ) ) ] ≤ n d α n a n ‖ δ ‖ + n b n α n 2 ‖ δ ‖ 2 .

With assumption (A6), we can prove that Δ₂ dominates Δ₃ uniformly in ‖δ‖=C. Therefore, (12) holds for sufficiently large C, and we have ‖γ^vc−γ~‖=Op(αn).

In order to prove β^j(z) is constant for j=v+1,…,d in probability, it is sufficient to demonstrate that γ^j∗vc=0, j=v+1,…,d. Note that when max{λ1,λ2}→0, a_n = 0 for large n. Then we need to show that with probability approaching 1 as n → ∞, for any γ^vc satisfying ‖γ^vc−γ~‖=Op(n−12kn) and some small εn=Cn−12kn, we have

∂ Q 1 ( γ ) ∂ γ j , ∗ < 0 , for − ε n < γ j , ∗ < 0 , j = v + 1 , … , d ; > 0 , for 0 < γ j , ∗ < ε n , j = v + 1 , … , d .

where γj,∗ denotes the individual component of γj∗. It can be shown that,

∂ Q 1 ( γ ^ v c ) ∂ γ ^ j , ∗ v c = − 2 ∑ i = 1 n U i j [ Y i − U i T γ ^ v c ] + n p λ 1 ′ ( | γ ^ j , ∗ | ) sgn ( γ ^ j , ∗ ) = − 2 ∑ i = 1 n U i j [ ε i + X i T r ( z i ) ] − 2 ∑ i = 1 n U i j U i T [ γ ~ − γ ^ v c ] + n p λ 1 ′ ( | γ ^ j , ∗ | ) sgn ( γ ^ j , ∗ v c ) = n λ 1 [ O p ( λ 1 − 1 n − r + 1 / 2 2 r + 1 ) + λ 1 − 1 p λ ′ ( | γ ^ j , ∗ | ) sgn ( γ ^ j , ∗ v c ) ] .

By assumption (A5), λ1−1n−r+1/22r+1→0. Then it follows from assumption (A7) that the sign of the derivative is completely determined by that of γ^j,∗vc. Therefore, γ^vc, the minimizer of Q₁, is achieved at γ^j∗vc=0, j=v+1,…,d. This completes the proof of Theorem 1(1), part 1. □

(II) Proof of Theorem 1(2)

Next we establish the consistency of the varying coefficient estimators. Let αn=n−12kn+an, γ^(v)=γ~(v)+αnδv, γ^(d)=γ~(d)+αnδd, δ=(δvT,δdT)T, and

(13) Q 2 ( γ ( v ) , γ ( d ) ) = ∑ i = 1 n ( Y i − U ( v ) i T γ ( v ) − U ( d ) i T γ ( d ) ) 2 + n ∑ j = v + 1 d p λ 2 ( | γ j , 1 | ) .

We first show that there exists a local minimizer of Q2(γ(v),γ(d)). It suffices to show that for any given ε > 0, there exists a large constant C such that

(14) P { inf ‖ δ ‖ = C Q 2 ( γ ^ ( v ) , γ ^ ( d ) ) ≥ Q 2 ( γ ~ ( v ) , γ ~ ( d ) ) } ≥ 1 − ε .

which implies that with probability at least 1 − ε there exists a local minimum in the ball {γ~(v)+αnδv:‖δv‖≤C} and {γ~(d)+αnδd:‖δd‖≤C}, respectively. Therefore, there exists local minimizers such that ‖γ^(v)−γ~(v)‖=Op(αn) and ‖γ^(d)−γ~(d)‖=Op(αn). We have

D n ( δ v , δ d ) = Q 2 ( γ ^ ( v ) , γ ^ ( d ) ) − Q 2 ( γ ~ ( v ) , γ ~ ( d ) ) = − 2 α n ∑ i = 1 n [ ε i + X i T r ( z i ) ] [ U ( v ) i T δ ( v ) + U ( d ) i T δ ( d ) ] + α n 2 ∑ i = 1 n [ U ( v ) i T δ ( v ) + U ( d ) i T δ ( d ) ] 2 + n ∑ j = v + 1 d [ p λ 2 ( | γ ^ j , 1 | ) − p λ 2 ( | γ ~ j , 1 ) | ] := Δ 1 + Δ 2 + Δ 3 ,

where r(z)=(r1(z),…,rd(z))T and rj(z)=B(z)Tγ~j−βj(z), j = 1, …, d. Since E(εi|U(v), U(d),zi)=0, we have

(15) 1 n ∑ i = 1 n ε i [ U ( v ) i T δ ( v ) + U ( d ) i T δ ( d ) ] = O p ( ‖ δ ‖ ) .

With Lemma 2 we can show

1 n ∑ i = 1 n X i T r ( z i ) [ U ( v ) i T δ ( v ) + U ( d ) i T δ ( d ) ] = O p ( k n − r ‖ δ ‖ ) .

Combine the above two equations, we can obtain that

Δ 1 = O p ( n 1 2 α n ‖ δ ‖ ) + O p ( n k n − r α n ‖ δ ‖ ) = O p ( n k n − r α n ) ‖ δ ‖ .

Since Δ2=Op(nαn2)‖δ‖2, it can be shown that by choosing a sufficiently large C, Δ₁ is dominated by Δ₂ uniformly in ‖δ‖=C. By Taylor expansion,

Δ 3 ≤ n ∑ j = v + 1 d [ α n p λ 2 ′ ( | γ ~ j , 1 | ) sgn ( γ ~ j , 1 ) | δ j 1 | + α n 2 p λ 2 ′ ′ ( | γ ~ j , 1 | ) δ j 1 2 ( 1 + o ( 1 ) ) ] ≤ ( d − v ) 1 2 n α n a n ‖ δ ‖ + n b n α n 2 ‖ δ ‖ 2 .

Recall assumption A6, then it follows that, by choosing an enough large C, Δ₂ dominates Δ₁ uniformly in ‖δ‖=C. Consequently (14) holds for sufficiently large C, and we have ‖γv^−γv~‖=Op(αn) and ‖γd^−γd~‖=Op(αn). By the definition of γcz, we have γ^(d)cz−γ~(d)=Op(αn). Then for j = 0, …, v

‖ β ^ j ( z i ) − β j ( z ) ‖ 2 = ∫ 0 1 [ β ^ j ( z ) − β j ( z ) ] 2 d z ≤ 2 ∫ 0 1 [ B ( z ) T γ ^ j c z ( z ) − B ( z ) T γ ~ j ] 2 d z + 2 ∫ 0 1 r j 2 ( z ) d z = 2 n ( γ ^ j c z − γ ~ j ) T G n ( γ ^ j c z − γ ~ j ) + 2 ∫ 0 1 r j 2 ( z ) d z := Δ 1 + Δ 2 .

Recall Lemma 1, Lemma 2 and kn=O(n12r+1), we can demonstrate that Δ1=Op(kn−1αn2), Δ2=Op(kn−2r). Δ₁ is dominated by Δ₂, thus we finish the proof of Theorem 1(2). □

(III) Proof of Theorem 1(1), part 2

To show β^j(z)=0 for j=c+1,…,d, it is sufficient to demonstrate that γ^j,1cz=0, since the constancy of βj(z), j=v+1,…,d was already established in (B). By definition, when max{λ1,λ2}→0, a_n = 0 for large n. Then we need to prove that with probability approaching 1 as n → ∞, for any γ^(v) and γ^(d) satisfying ‖γ^(v)−γ~(v)‖=Op(n−12kn), and ‖γ^(d)−γ~(d)‖=Op(n−12kn), as well as some small εn=Cn−12kn, we have

∂ Q 2 ( γ ( v ) , γ ( d ) ) ∂ γ j , 1 < 0 , for − ε n < γ j , 1 < 0 , j = c + 1 , … , d ; > 0 , for 0 < γ j , 1 < ε n , j = c + 1 , … , d .

It can be shown that

∂ Q 2 ( γ ^ ( v ) , γ ^ ( d ) ) ∂ γ ^ j , 1 = − 2 ∑ i = 1 n U ( d ) i j [ Y i − U ( v ) i T γ ^ ( v ) − U ( d ) i T γ ^ ( d ) ] + n p λ ′ ( | γ ^ j , 1 | ) sgn ( γ ^ j , 1 ) = − 2 ∑ i = 1 n U ( d ) i j [ ε i + X i T r ( z i ) ] − 2 ∑ i = 1 n U ( d ) i j U ( v ) i T [ γ ~ v − γ ^ v ] − 2 ∑ i = 1 n U ( d ) i j U ( d ) i T [ γ ~ d − γ ^ d ] + n p λ ′ ( | γ ^ j , 1 | ) sgn ( γ ^ j , 1 ) = n λ 2 [ O p ( λ 2 − 1 n − r + 1 / 2 2 r + 1 ) + λ 2 − 1 p λ ′ ( | γ ^ j , 1 | ) sgn ( γ ^ j , 1 ) ] .

By assumption (A5), λ2−1n−r+1/22r+1→0. Then it follows from assumption (A7) that the sign of the derivative is completely determined by that of γ^j,1. Therefore, γ^cz, the minimizer of Q₂, is achieved at γ^j,1cz=0, j=c+1,…,d. This completes the proof of Theorem 1(1). □

Proof of Theorem 2

In Theorem 1, we showed that both γ^j∗=0, j=v+1,…,c and γ^j=0, j=c+1,…,d, hold with probability approaching 1. Then Q₂ reduces to

(16) Q 2 ( γ ( v ) , γ ( d ) ) = ∑ i = 1 n ( Y i − U ( v ) i T γ ( v ) − U ( c ) i T γ ( c ) ) 2 + n ∑ j = v + 1 c p λ 2 ( | γ j , 1 | ) := Q 2 ( γ ( v ) , γ ( c ) ) .

Since (γ^(v),γ^(c)) is the minimizer of Q2(γ(v),γ(c)), we obtain

∂ Q 2 ( γ ^ ( v ) , γ ^ ( c ) ) ∂ γ ^ ( v ) = − 2 ∑ i = 1 n U ( v ) i [ Y i − U ( v ) i T γ ^ ( v ) − U ( d ) i T γ ^ ( d ) ] = 0 ;

(17) ∂ Q 2 ( γ ^ ( v ) , γ ^ ( c ) ) ∂ γ ^ ( c ) = − 2 ∑ i = 1 n U ( c ) i [ Y i − U ( v ) i T γ ^ ( v ) − U ( c ) i T γ ^ ( c ) ] + n ∑ j = v + 1 c p λ 2 ′ ( | γ ^ j , 1 | ) sgn ( γ ^ j , 1 ) = 0.

By applying Taylor expansion on pλ2′(|γ^j,1|) in (17), we have

p λ 2 ′ ( | γ ^ j , 1 | ) = p λ 2 ′ ( | γ j , 1 | ) + p λ 2 ′ ′ ( | γ j , 1 | ) ( γ ^ j , 1 − γ j , 1 ) [ 1 + o p ( 1 ) ] .

By the fact that pλ2′(|γ^j,1|)=0 as λ₂ → 0, and pλ2′′(|γj,1|)=op(1) from the assumption, it follows that

∑ j = v + 1 c p λ 2 ′ ( | γ ^ j , 1 | ) sgn ( γ ^ j , 1 ) = o p ( γ ^ j , 1 − γ j , 1 ) = o p ( γ ^ ( c ) − γ ( c ) ) .

Consequently, we have

1 n ∑ i = 1 n U ( c ) i [ Y i − U ( v ) i T γ ^ ( v ) − U ( c ) i T γ ^ ( c ) ] + o p ( γ ^ ( c ) − γ ( c ) ) = 0.

Following similar lines of arguments in Theorem 1, we can show

(18) 1 n ∑ i = 1 n U ( c ) i [ ε i + X i T r ( z i ) + U ( v ) i T ( γ ( v ) − γ ^ ( v ) ) + U ( c ) i T ( γ ( c ) − γ ^ ( c ) ) ] + o p ( γ ^ ( c ) − γ ( c ) ) = 0.

Meanwhile, a straightforward calculation yields

(19) 1 n ∑ i = 1 n U ( v ) i [ ε i + X i T r ( u i ) + U ( v ) i T ( γ ( v ) − γ ^ ( v ) ) + U ( c ) i T ( γ ( c ) − γ ^ ( c ) ) ] = 0.

Recall the definition of Φ_n and Ψ_n, (19) is equivalent to

(20) γ ^ ( v ) − γ ( v ) = Φ n − 1 { 1 n ∑ i = 1 n U ( v ) i [ ε i + X i T r ( z i ) ] + Ψ n [ γ ( c ) − γ ^ ( c ) ] } .

Plugging (20) into (18) results in

(21) 1 n ∑ i = 1 n U ( c ) i { ε i + X i T r ( z i ) − U ( v ) i T Φ n − 1 1 n ∑ i = 1 n U ( v ) i × [ ε i + X i T r ( z i ) ] } = 1 n ∑ i = 1 n U ( c ) i [ U ( c ) i − Ψ n T Φ n − 1 U ( v ) i ] T ( γ ^ ( c ) − γ ( c ) ) + o p ( γ ^ ( c ) − γ ( c ) ) .

Together with the facts that

1 n ∑ i = 1 n Ψ n T Φ n − 1 U ( v ) i [ ε i + X i T r ( z i ) − U ( v ) i T Φ n − 1 × 1 n ∑ j = 1 n U ( v ) j [ ε j + X j T r ( z j ) ] ] = 0

and

1 n ∑ i = 1 n Ψ n T Φ n − 1 U ( v ) i [ U ( c ) i − Ψ n T Φ n − 1 U ( v ) i ] T = 0.

and recall the definition of Λ_i, a direct computation from (21) leads to

[ 1 n ∑ i = 1 n Λ i Λ i T + o p ( 1 ) ] n ( γ ( c ) − γ ^ ( c ) ) = 1 n ∑ i = 1 n Λ i ε i + 1 n ∑ i = 1 n Λ i X i T r ( z i ) + 1 n ∑ i = 1 n Λ i U ( v ) i T Φ n − 1 1 n ∑ j = 1 n U ( v ) j [ ε j + X j T r ( z j ) ] := Δ 1 + Δ 2 + Δ 3 .

It follows from the law of large numbers that

1 n ∑ i = 1 n Λ i Λ i T → p Σ

where

(22) Σ = E [ U ( c ) ( I − U ( v ) ( U ( v ) U ( v ) T ) − 1 U ( v ) T ) U ( c ) T ]

Consequently,

Δ 1 → d N ( 0 , σ 2 Σ )

follows from central limit theorem. Because X_i is bounded and ‖r(z)‖=op(1), we have Δ2=op(1). Besides, ∑i=1nΛiU(v)iT=0 implies that Δ₃ = 0. Therefore, by Slutsky theorem, we complete the proof of Theorem 2. □

References

Antoniadis, A., I. Gijbels and S. Lambert-Lacroix (2014): “Penalized estimation in additive varying coefficient models using grouped regularization,” Stat. Pap., 55, 727–750.10.1007/s00362-013-0522-1Search in Google Scholar

Chatterjee, N. and R. J. Carroll (2005): “Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies,” Biometrika, 92, 399–418.10.1093/biomet/92.2.399Search in Google Scholar

Chen, Y.-H., N. Chatterjee and R. J. Carroll (2013): “Using shared genetic controls in studies of gene-environment interactions,” Biometrika, 100, 319–338.10.1093/biomet/ass078Search in Google Scholar PubMed PubMed Central

Cornelis, M. C., E. J. Tchetgen, L. Liang, L. Qi, N. Chatterjee, F. B. Hu and P. Kraft (2011): “Gene-environment interactions in genome-wide association studies: a comparative study of tests applied to empirical studies of type 2 diabetes,” Am. J. Epidemiol., 175, 191–202.10.1093/aje/kwr368Search in Google Scholar PubMed PubMed Central

Cui, Y. H., G. L. Kang, K.L. Sun, M. Qian, R. Romero and W. Fu (2008): “Gene-centric genomewide association study via entropy,” Genetics, 179, 637–650.10.1534/genetics.107.082370Search in Google Scholar PubMed PubMed Central

Efron, B. and R. Tibshirani (2007): “On testing the significance of sets of genes,” Ann. Appl. Stat., 1, 107–129.10.1214/07-AOAS101Search in Google Scholar

Feinberg, A. P. (2004): “Phenotypic plasticity and the epigenetics of human disease,” Nature, 447, 433–440.10.1038/nature05919Search in Google Scholar PubMed

Fan, J. Q. and R. Z. Li (2001): “Variable selection via nonconcave penzlied likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1348–1360.10.1198/016214501753382273Search in Google Scholar

Guo, S. W. (2000): “Gene-environment interaction and the mapping of complex traits: some statistical models and their implications,” Hum. Hered., 50, 286–303.10.1159/000022931Search in Google Scholar PubMed

Hastie, T. and R. Tibshirani (1993): “Varying-coefficient models,” J. R. Stat. Soc. B, 55, 757–796.10.1111/j.2517-6161.1993.tb01939.xSearch in Google Scholar

Hu, T. and Y. Xia (2012): “Adaptive semi-varying coefficient model selection,” Stat. Sin., 22, 575–599.10.5705/ss.2010.105Search in Google Scholar

Huang, J. Z., Wu, C. O., and Zhou, L. (2002): “Varying-coefficient models and basis function approximations for the analysis of repeated measurements.” Biometrika, 89, 111–128.10.1093/biomet/89.1.111Search in Google Scholar

Huang, J. H., Wu, C. O., and Zhou L. (2004): “Polynomial spline estimation and inference for varying coefficient models with longitudinal data,” Stat. Sin., 14, 763–788.Search in Google Scholar

Hutter, C. M., L. E. Mechanic, N. Chatterjee, P. Kraft and E. M. Gillanders. (2013): “Gene-environment interactions in cancer epidemiology: a national cancer institute think tank report,” Genet. Epidemiol., 37, 643–657.10.1002/gepi.21756Search in Google Scholar PubMed PubMed Central

Kim, M. O. (2007): “Quantile regression with varying coefficients,” Ann. Stat., 35, 92–108.10.1214/009053606000000966Search in Google Scholar

Liu, L., Y. Li and T. O. Tollefsbol (2008): “Gene-environment interactions and epigenetic basis of human diseases,” Curr. Issues Mol. Biol., 10, 25–36.Search in Google Scholar

Ma, S., L. Yang, R. Romero and Y. Cui (2011): “Varying coefficient model for gene-environment interaction: a non-linear look,” Bioinformatics, 27, 2119–2126.10.1093/bioinformatics/btr318Search in Google Scholar PubMed PubMed Central

Maity, A., R. J. Carrol, E. Mammen and N. Chatterjee (2009): “Testing in semiparametric models with interaction, with applications to gene-environment interactions,” J. R. Stat. Soc. B, 71, 75–96.10.1111/j.1467-9868.2008.00671.xSearch in Google Scholar PubMed PubMed Central

Rawlings, J. S., K. M. Rosler and D. A. Harrison (2004): “The JAK/STAT signaling pathway,” J. Cell Sci., 117, 1281–1283.10.1242/jcs.00963Search in Google Scholar PubMed

Sata F, S. Toya, H. Yamada, K. Suzuki, Y. Saijo, A. Yamazaki, H. Minakami and R. Kishi (2009): “Proinflammatory cytokine polymorphisms and the risk of preterm birth and low birthweight in a Japanese population,” Mol. Hum. Reprod., 15, 121–130.10.1093/molehr/gan078Search in Google Scholar PubMed

Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464.10.1214/aos/1176344136Search in Google Scholar

Schaid, D. J., J. P. Sinnwell, G. D. Jenkins, S. K. McDonnell, J. N. Ingle, M. Kubo, P. E. Goss, J. P. Costantino, D. L. Wickerham, and R. M. Weinshilboum (2012): “Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies,” Genet. Epidemiol., 36, 3–16.10.1002/gepi.20632Search in Google Scholar PubMed PubMed Central

Schumaker, L. L. (1981): Spline Functions: basic theory, Wiley, New York.Search in Google Scholar

Stamnes Koepp, U. M., L. F. Andersen, K. Dahl-Joergensen, H. Stigum, O. Nass and W. Nystad (2012): “Maternal pre-pregnant body mass index, maternal weight change and offspring birthweight,” Acta Obstet. Gynecol. Scand., 91, 243–249.10.1111/j.1600-0412.2011.01321.xSearch in Google Scholar PubMed

Tang, Y. L., H. X. Wang, Z. Y. Zhu, X. Song (2012): “A unified variable selection approach for varying coefficient models,” Stat. Sin., 22, 601–628.10.5705/ss.2010.121Search in Google Scholar

Wang, K., M. Li and H. Hakonarson. (2011): “Analysing biological pathways in genome-wide association studies,” Nat. Rev. Genet., 11, 843–854.10.1038/nrg2884Search in Google Scholar PubMed

Wang, L. F., H. Z. Li and J. Z. Huang. (2008): “Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements,” J. Am. Stat. Assoc., 103, 1556–1569.10.1198/016214508000000788Search in Google Scholar PubMed PubMed Central

Wu, C. and Y. Cui (2013): “A novel method for identifying nonlinear gene-environment interactions in case-control association studies,” Hum. Genet., 132, 1413–1425.10.1007/s00439-013-1350-zSearch in Google Scholar PubMed

Wu, C. and Y. Cui (2014): “Boosting signals in gene-based association studies via efficient SNP selection,” Brief. Bioinform., 15, 279–291.10.1093/bib/bbs087Search in Google Scholar PubMed

Xue, L. and A. Qu (2012): “Variable selection in high-dimensional varying coefficient models with global optimality,” J. Mach. Learn. Res., 13, 1973–1998.Search in Google Scholar

Zhang, H. H., G. Cheng and Y. Liu (2011): “Linear or nonlinear? Automatic structure discovery for partially linear models,” J. Am. Stat. Assoc., 106, 1099–1112.10.1198/jasa.2011.tm10281Search in Google Scholar PubMed PubMed Central

Published Online: 2018-02-08

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2017-0008

Keywords for this article

B-spline; gene-set analysis; local quadratic approximation; SCAD penalty; variable selection

Creative Commons

BY-NC-ND 4.0