A post-hoc Unweighted Analysis of Counter-Matched Case-Control Data

Cyril Rakovski; Bryan Langholz

doi:10.1515/ijb-2014-0018

Article Publicly Available

A post-hoc Unweighted Analysis of Counter-Matched Case-Control Data

Cyril Rakovski and Bryan Langholz

Published/Copyright: August 26, 2015

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 11 Issue 2

Abstract

Informative sampling based on counter-matching risk set subjects on exposure correlated with a variable of interest has been shown to be an efficient alternative to simple random sampling; however, the opposite is true when correlation between the two covariates is absent. Thus, the counter-matching design will entail substantial gains in statistical efficiency compared to simple random sampling at a first stage of analyses focused by design on variables correlated with the counter-matching variable but will lose efficiency at a second stage of analyses aimed at variables independent of the counter-matching variable and not conceived as a part of the initial study. In an effort to recover efficiency in such second stage of analyses scenarios, we considered a naive analysis of the effect of a dichotomous covariate on the disease rates in the population that ignores the underlying counter-matching sampling design. We derive analytical expressions for the bias and variance and show that when the counter-matching and the new dichotomous variable of interest are uncorrelated and a multiplicative main effects model holds, such an analysis is advantageous over the standard “weighted” approach, especially when the counter-matching variable is rare and in such scenarios the efficiency gains exceeds 80%. Moreover, we consider all possible conceptual violations of the required assumptions and show that moderate departures from the above-mentioned requirements lead to negligible levels of bias; numerical values for the bias under common scenarios are provided. The method is illustrated via an analysis of BRCA1/2 deleterious mutations in the radiation treatment counter-matched WECARE study of second breast cancer.

Keywords: efficiency; partial likelihood; proportional hazards; sampling

1 Introduction

The counter-matching design is an exposure-stratified nested case-control study method that incorporates exposure information, available on all cohort subjects, into the case-control sample. Every counter-matching design, say CM(m0:m1) with a stratifying dichotomous variable R attaining values 0 and 1, is based on samples from appropriate risk sets defined by the number of subjects m0 and m1 for which R=0 and R=1 respectively. Thus, counter-matching differs from the classical technique of matching in two aspects, first by achieving desirable diversification of the values of the subjects in each case-control strata instead of keeping them constant (thus allowing estimation) and by being defined by the number of subjects in each strata with R values of 0 and 1 instead of by the number of cases and controls.

Previous studies have shown that generally, counter-matching is statistically more efficient than simple random sampling of controls for estimation of factors well correlated with the counter-matching variable, but less efficient for variables that are not well correlated with the counter-matching variable [1–3]. For example, in the WECARE study of radiation, genes, and asynchronous contralateral (CBC) breast cancer, unilateral breast cancer (UBC) controls were 1:2 counter-matched, CM(1:2), on radiation treatment status to each CBC case. It was demonstrated that the counter-matched design was more efficient than simple random sampling of two controls, SRS(3), for estimation of radiation main effects and gene by radiation interactions [4], but there is some loss in efficiency for the main effect of gene. Let R and G be indicator variables for radiation treatment and gene carrier status with R be the counter-matching variable and G be assessed only on the sample. Then, in standard proportional hazards models, counter-matching is more efficient than random sampling for assessing the main effect of R and R×G interaction effects, but less efficient for estimating the main effect of G [1, 2]. In fact, the worse situation for estimation of G effects is when G and R are independent, in which case counter-matching can be quite inefficient relative to simple random sampling, depending on E(R) [3].

Several other important studies besides the WECARE, such as the Crystalline Silica Exposure and Silicosis in Gold Miners [5], the Radiation, Hormones, and Breast Cancer of Japanese Atomic Bomb Survivors [6], Early Asthma Risk Study (EARS) of In Utero and Early Life Exposures and Asthma [7], have implemented the counter-matching sampling design to exploit its inherent gains in statistical efficiency over simple random sampling at a first stage of analyses with respect to the initial exposures of interest by counter-matching on the covariates correlated with the these exposures of interest. However, at a later point the effects of new lifestyle, environmental and genetic risk factors could become desirable research objectives and a second stage analyses of these datasets with respect to new covariates will be required. In the likely case that the second stage covariates are independent of the counter-matching variable, the proper weighted analysis will lose efficiency compared to the case if the data were obtained under simple random sampling. Under mild conditions, the approach proposed in this work will enable these and future counter-matched studies to remove the disadvantageous properties of the counter-matched data when re-analyzed at a later point with respect to covariates independent of the counter-matching variable.

In this paper, we consider dichotomous covariates G and R and explore the estimation of the effect of G from data m0:m1 counter-matched on R. In particular, we show that if G and R are independent and R and G act multiplicatively on the disease rates, then βˆG estimated from the standard SRS(m0+m1) likelihood (without regard to R) is a valid estimator of βG with asymptotic variance equal to that from a SRS(m0+m1) sample of the same cohort population, and at least as efficient as the appropriate CM(m0:m1) analysis. We also found that even though bias occurs if the independence assumption and/or multiplicative disease rate model are violated, the actual levels of bias for moderate such violations are small.

2 Methods

We rely on the nested case-control model for individually matched case-control studies in which controls are sampled from the risk sets at the failure times [3, 8]. While the counting process formulation would fully account of the failure time component of this problem, it is the sampling that is relevant and to keep the notation to a minimum, we analyze the contribution from a single case-control set and drop reference to time [9]. Let D be the case index and R be the randomly sampled counter-matched case-control set. In particular, with R known for all subjects in the risk set, consider CM(m0:m1) sampling in which controls are simple random samples from the R-strata such that the case-control set R consists of m0 and m1 subjects with R=0 and R=1, respectively. With rate ratios associated with R and G given by exp(RβR+GβG), consider estimation of βG using the likelihood contribution:

l(β)=exp(GDβ)∑i∈Rexp(Giβ),

the “unweighted” likelihood contribution as from a SRS(m), that includes only G in the model and m=m0+m1.

The following theorem provides the essential part of the results in this paper.

Theorem 1

Let G and R be independent binary covariates that have multiplicative effects on the disease rates of the population. Assume that the data have been obtained through an implementation of am0:m1counter-matching on R. Then, an analysis of the association of G with the disease by treating the data as originating from1:m−1simple random sampling and disregarding R, produces an unbiased estimate of the effectβG. Moreover, asymptotically, the variance ofβˆGequals the variance of the corresponding estimator arising from an analysis of data obtained under SRS(m).

Proof. By definition, the disease risks and likelihood contributions from strata are given by ri=eGiβG+RiβR and l(β)=exp(GDβ)/∑i∈R˜exp(Giβ) respectively. Let R be the population risk set from which R˜ was sampled and r⊂R denote a risk set of size m=|r|=|R˜| and number of exposed subjects k=∑i∈rI(Gi=1). Sampling designs are analytically expressed through sampling probabilities π(r|i) that represent the likelihood of selecting the set r if subject i were the case. Further, let pi,r be the probability that the i-th subject is the case given the risk set r. In the following formulae, the expectations are taken over the CM(m0:m1) counter-matched sampling design that by definition had been implemented to produce the data. Regarding the first statement in Theorem 1, a simple likelihood theory argument shows that we need to solve the following equation to derive an analytic expression for the bias:

(1)E[BR˜(β)]=∑r⊂R∑i∈rBrpi,rπ(r|i)=0,

where Br=G−Er(β) and Er(β)=E(m,k) is the expectation of β and thus equals keβ/(m−k+keβ). Regarding the second statement in Theorem 1, we are interested in evaluating the asymptotic behavior of the following expression,

(2)EI [vR(βG)]=∑r⊂R∑i∈rvrpi,rπ(r|i)

where vr=vr(m,k) is the second derivative of the log-likelihood derived from l(β) and is consequently given by (m−k)keβG/(m−k+keβG)2.

Therefore, the proofs of both the unbiasedness and variance statements in Theorem 1 are predicated on an evaluation of expressions of the type:

(3)EI [TR(βG)]=∑r⊂R∑i∈rTrpi,rπ(r|i)

where Tr=T(m,k) is an appropriately selected function of the total size and the number of exposed subjects in the risk set r. For simplicity of notation, we suppress the explicit dependence of Tr on the first argument since it is fixed by design throughout the subsequent sections. Let Rjl=∪i∈RI(Gi=j,Ri=l), j,l=0,1 be the four strata of subjects relative to their covariate values and njl=|Rjl|. Further, we define πrs=limn→∞nrs/n, r,s∈{0,1,.} and here we have employed the classic contingency table notation nj.=nj0+nj1, n.l=n0l+n1l. The following lemma provides an insight into the behavior of the expression (3) under the assumptions of Theorem 1.

Lemma 1

LetTr=T(k)conforms with the above-mentioned definition. Then, under the assumptions of Theorem 1, ifEI [TR(βG)]=∑r⊂R∑i∈rTrpi,rπ(r|i)then, VT=limn→∞EI [TR(βG)]is equivalent to a homogeneous polynomial∑k=1m−1αm−k,kπ.0m−kπ.1kwith coefficients given by the following expressions:

(4)αm−k,k=T(k)[m−1kπ0.+m−1k−1π0.eβG+m−1kπ1.eβR+m−1k−1π1.eβR+βG](π0.+π1.eβR)(π.0+π.1eβG)

Proof. Let Rjl=∪i∈RI(Gi=j,Ri=l), where j,l=0,1 be the four strata of subjects relative to their covariate values and njl=|Rjl|. Then, substituting the appropriate sampling and case probabilities π(r|i) and pi,r in eq. (3) yields,

(5)E[TR˜(βG)]=∑r⊂R∑i∈rTreGiβG+RiβRn00+n01eβG+n10eβR+n11eβG+βRnRi.mRin0.m0n1.m1.

A change in the order of summation allows us to rewrite eq. (5) in the following way,

(6)EI [TR(βG)]=1(n0.m0)(n1.m1)1n00+n01eβG+n10eβR+n11eβG+βR ×∑i∈ReGiβG+RiβRnRi.mRi∑r⊂RTr,

After an explicit enumeration of all possible ways of obtaining samples r⊂R under the adopted CM(m0:m1) counter-matched design with respect to the strata Rjl, the expression (6) becomes:

(7)EI [TR(βG)]=1(n0.m0)(n1.m1)(n00+n01eβG+n10eβR+n11eβG+βR)×[n00n0.m0∑k0=0m0−1∑k1=0m1Ck0n01Cm0−1−k0n00−1Ck1n11Cm1−k1n10T(k0+k1) +n01n0.m0∑k0=0m0−1∑k1=0m1Ck0n01−1Cm0−1−k0n00Ck1n11Cm1−k1n10eβGT(k0+k1+1) +n10n1.m1∑k0=0m0∑k1=0m1−1Ck0n01Cm0−k0n00Ck1n11Cm1−1−k1n10−1eβRT(k0+k1) +n11n1.m1∑k0=0m0∑k1=0m1−1Ck0n01Cm0−k0n00Ck1n11−1Cm1−1−k1n10eβR+βGT(k0+k1+1)]

where for brevity we use Ccnjl to denote the standard binomial coefficients njlc reflecting the number of sets of size c that could be attained from the subset of subjects with the desired pair of exposures Rjl.

We are interested in assessing the following limit VT=limn→∞E[TR˜(βG)]. In order to facilitate the derivation of the asymptotic behavior of E[TR˜(βG)], we multiply and divide eq. (7) by n2m0+2m1−1 to obtain:

(8)VT==m0!m1!π0.m0π1.m1(π00+π01eβG+π10eβR+π11eβR+βG) ×[∑k0=0m0−1∑k1=0m1π00π0.π01k0π00m0−1−k0π11k1π10m1−k1m0k0!(m0−1−k0)!k1!(m1−k1)!T(k0+k1) +∑k0=0m0−1∑k1=0m1π01π0.π01k0π00m0−1−k0π11k1π10m1−k1m0k0!(m0−1−k0)!k1!(m1−k1)!eβGT(k0+k1+1) +∑k0=0m0∑k1=0m1−1π10π1.π01k0π00m0−k0π11k1π10m1−1−k1m1k0!(m0−k0)!k1!(m1−1−k1)!eβRT(k0+k1) +∑k0=0m0∑k1=0m1−1π11π1.π01k0π00m0−k0π11k1π10m1−1−k1m1k0!(m0−k0)!k1!(m1−1−k1)!eβR+βGT(k0+k1+1)].

Further, under the assumption of independence between R and G, πjl=π.lπj. holds for all j,l=0,1 and after trivial simplifications (8) becomes,

(9)VT=1(π0.+π1.eβR)(π.0+π.1eβG)×[∑k0=0m0−1∑k1=0m1(m0−1k0)(m1k1)π0.π.0k0+k1−m0−m1π.1k0+k1T(k0+k1) +∑k0=0m0−1∑k1=0m1(m0−1k0)(m1k1)π0.π.0k0+k1+1−m0+m1π.1k0+k1+1eβGT(k0+k1+1) +∑k0=0m0∑k1=0m1−1(m0k0)(m1k1−1)π1.π.0k0+k1−m0−m1π.1k0+k1eβRT(k0+k1) +∑k0=0m0∑k1=0m1−1(m0k0)(m1k1−1)π1.π.0k0+k1+1−m0−m1π.1k0+k1+1eβR+βGT(k0+k1+1)].

Now, we view the expression (9) as a homogeneous polynomial of the type ∑k=1m−1αm−k,kπ.0m−kπ.1k, where m=m0+m1. Therefore, the formulae for the coefficients are given by,

(10)αm−k,k=1(π0.+π1.eβR)(π.0+π.1eβG)×[∑s=0k(m0−1s)(m1k−s)π0.T(k)+∑s=0k−1(m0−1s)(m1k−1−s)π0.eβGT(k) +∑s=0k(m0s)(m1−1k−s)π1.eβRT(k)+∑s=0k−1(m0s)(m1−1k−1−s)π1.eβR+βGT(k)]

Lastly, using the combinatorial equality ∑s=0rxsyr−s=x+yk with (r,x,y)=(k,m0−1,m1), (r,x,y)=(k−1,m0−1,m1), (r,x,y)=(k,m0,m1−1), and (r,x,y)=(k−1,m0,m1−1), we conclude,

(11)αm−k,k=T(k)[m−1kπ0.+m−1k−1π0.eβG+m−1kπ1.eβR+m−1k−1π1.eβR+βG](π0.+π1.eβR)(π.0+π.1eβG)

This completes the proof of Lemma 1.

In the notation for the remainder of the paper, we suppress the dependence of vr=v(m,k) and Er(β)=E(m,k) on m=|r|.

Next, we prove the unbiasedness statement of Theorem 1. To evidence the validity of this part of theorem, it suffices to show that B(βG)=0. An almost direct application of Lemma 1 with Tr=Br=G−Er(β) demonstrates that B(βG) is equivalent to a homogeneous polynomial with coefficients given by,

αm−k,k∗∝m−1kπ0.−keβGm−k+keβG+m−1k−1π0.eβG1−keβGm−k+keβG+m−1kπ1.eβR−keβGm−k+keβG+m−1k−1π1.eβR+βG1−keβGm−k+keβG.

It is easily seen that αm−k,k∗=0 for all k and therefore B(βG)=0.

Finally, we prove the variance statement of Theorem 1. An application of Lemma 1 with Tr=vr=v(k) evinces that expression (2) can be written as E[vR˜(βG)]=∑k=1m−1αm−k,kπ.0m−kπ.1k, where the coefficients are given by:

(12)αm−k,k=v(k)[m−1kπ0.+m−1k−1π0.eβGm−1kπ1.eβR+m−1k−1π1.eβR+βG](π0.+π1.eβR)(π.0+π.1eβG)

However, after trivial simplifications (12) reduces to

(13)αm−k,k=eβGπ.0+π.1eβG(m−1)m−2k−11m−k+keβG.

This completes the proof of the variance statement since all the coefficients of the polynomial associated with expression (13) match those in front of the same monomials in the asymptotic variance VvSRS=limn→∞E[vR˜SRS(βG)] derived under SRS [10, 11]):

VvSRS=eβGπ.0+π.1eβG(m−1)∑k=1m−1m−2k−1π.0m−kπ.1km−k+keβG.

Consequently, we have shown that under the conditions of Theorem 1 the unweighted analysis of the data is a proper statistical tool for analysis of counter-matched data. More importantly, there is a substantial gain of statistical efficiency associated with the implementation of such unweighted analysis under these assumptions and that corroborates previous results regarding case-control data counter-matched on an exposure independent of the covariate of interest [3].

3 Results

In an effort to quantify the gains in efficiency attributable to the advantageous properties of the proposed unweighted analysis, we implemented a large scale simulation study by varying counter-matched designs CM(m0:m1), prevalences E(G) and E(R), and effect sizes βG and βR over an extensive list of possible values. Our results show that for all counter-matching designs, the gain in efficiency is inversely related to the prevalence of the counter-matching variable R. In the cases of rare exposure E(R)=0.05, the efficiency gains exceed 80%. The efficiency gains are still substantial for E(R)≤0.2 but the advantages vanish for prevalences of R greater than 0.4. In particular, for the CM(1:2) design Table 1 displays the efficiency gains for all scenarios defined by varying E(G) over 0.1 and 0.3, E(R) over 0.05, 0.1 and 0.4, βG over log(2)/2 and log(2), and βR over log(2)/2 and log(2). Variances were empirically estimated using 1,000 datasets per scenario, each dataset contained 200 counter-matched trios and each trio was sampled from a risk set of size 500. These results highlight the gains of statistical efficiency of the proposed naïve analysis in all settings over the weighted analysis.

Table 1:

Theorem 1 numerical example. ARE results for independent G and R showing the efficiency gains of the unweighted analysis of 1:2 counter-matched data over the proper, weighted analysis.

βG	βR	E(R)	E(G)	Vˆar(βˆGCM(1:2))∗	Vˆar(βˆSRS(3))∗	ARE∗
log(2)/2	log(2)/2	0.05	0.10	0.36	0.08	0.21
log(2)/2	log(2)/2	0.10	0.10	0.22	0.07	0.34
log(2)/2	log(2)/2	0.40	0.10	0.07	0.07	0.92
log(2)/2	log(2)/2	0.05	0.30	0.20	0.03	0.17
log(2)/2	log(2)/2	0.10	0.30	0.10	0.03	0.35
log(2)/2	log(2)/2	0.40	0.30	0.04	0.04	0.88
log(2)/2	log(2)	0.05	0.10	0.28	0.07	0.25
log(2)/2	log(2)	0.10	0.10	0.16	0.07	0.45
log(2)/2	log(2)	0.40	0.10	0.08	0.07	0.96
log(2)/2	log(2)	0.05	0.30	0.15	0.03	0.22
log(2)/2	log(2)	0.10	0.30	0.08	0.04	0.47
log(2)/2	log(2)	0.40	0.30	0.04	0.04	0.96
log(2)	log(2)/2	0.05	0.10	0.36	0.07	0.19
log(2)	log(2)/2	0.10	0.10	0.17	0.07	0.39
log(2)	log(2)/2	0.40	0.10	0.07	0.07	0.92
log(2)	log(2)/2	0.05	0.30	0.19	0.03	0.18
log(2)	log(2)/2	0.10	0.30	0.09	0.03	0.34
log(2)	log(2)/2	0.40	0.30	0.04	0.03	0.92
log(2)	log(2)	0.05	0.10	0.25	0.06	0.24
log(2)	log(2)	0.10	0.10	0.15	0.07	0.45
log(2)	log(2)	0.40	0.10	0.07	0.07	0.97
log(2)	log(2)	0.05	0.30	0.14	0.03	0.24
log(2)	log(2)	0.10	0.30	0.07	0.04	0.47
log(2)	log(2)	0.40	0.30	0.04	0.03	0.98

Note: ^*Empirically estimated via 1000 datasets containing 200 counter-matched trios with each trio sampled from a risk set of size 500.

Corollary 1

The condition of independence between G and R is crucial to the correctness of the conclusions in Theorem 1. Deviations from this assumption induce bias in allm0:m1counter-matched designs.

Proof. As mentioned before, we need to solve eq. (1) with respect to β. Without the simplification available under independence, it is easily seen that this is an expression of the form,

(14)∑k=1m−1ϕkxm−k+kx=a,

where x=eβ. All denominators in the last expression are nonzero since x>0 and therefore eq. (14) is equivalent to a polynomial of degree m−1. It is well known that solutions to polynomials are not available in closed form for m≥5. However, simple expressions for the bias can be obtained for the common scenarios of 1:1, 2:1, and 1:2 counter-matched designs (m=2, m=3 and m=3 respectively) and straightforward numerical methods handle the remaining cases. For instance, for the CM(1:1) case, if γ and δ denote the sensitivity and specificity, we obtain:

x=γδeβR+βGπ1.+(1−γ)(1−δ)eβGπ0.γδπ0.+(1−γ)(1−δ)eβRπ1.,

which can be rewritten in terms of the prevalence of G by substituting π1.=γπ.1+(1−δ)π.0. Then, as shown in Lemma 1, under independence, γ=π1. and δ=π0. and eq. (14) produces an unbiased estimate x=eβG. However, we calculated the values for the bias corresponding to a range of values of γ, δ, π1. and several counter-matched scenarios with effect sizes fixed at βG=βR=log2 are shown in Table 2. As stated in Corollary 1, for all designs, moderate deviations from independence introduce bias but the actual levels of bias are small.

Table 2:

Corollary 1 numerical example. Bias as a function of the correlation between R and G induced by sensitivity, specificity and their prevalences for various counter-matching designs.

γ	δ	E(G)	E(R)	Bias(βˆG1:1)	Bias(βˆG1:2)	Bias(βˆG2:1)	Bias(βˆG1:3)	Bias(βˆG2:2)	Bias(βˆG3:1)
0.05	0.85	0.1	0.14	0.02	0.01	0.01	0	0	0
0.05	0.9	0.1	0.095	0.02	0	0.01	0	0	0
0.05	0.95	0.1	0.05	0	0	0	0	0	0
0.1	0.85	0.1	0.145	0.01	0	0	0	0	0
0.1	0.9	0.1	0.1	0	0	0	0	0	0
0.1	0.95	0.1	0.055	−0.04	0	−0.02	0	0	−0.02
0.15	0.85	0.1	0.15	0	0	0	0	0	0
0.15	0.9	0.1	0.105	−0.02	0	−0.01	0	0	0
0.15	0.95	0.1	0.06	−0.07	−0.01	−0.04	0	0	−0.03
0.15	0.75	0.2	0.23	0.01	0.01	0	0	0	−0.01
0.15	0.8	0.2	0.19	0.01	0	0	0	0	0
0.15	0.85	0.2	0.15	0	0	0	0	0	0
0.2	0.75	0.2	0.24	0.01	0	0	0	0	0
0.2	0.8	0.2	0.2	0	0	0	0	0	0
0.2	0.85	0.2	0.16	−0.01	0	0	0	0	0
0.25	0.75	0.2	0.25	0	0	0	0	0	0
0.25	0.8	0.2	0.21	−0.01	0	0	0	0	0
0.25	0.85	0.2	0.17	−0.03	−0.01	0	0	0	0

Note:^*Here βG=log(2) and βR=log(2).

The conclusions for correlated covariates shown in Corollary 1 are presented for completeness of exposition and reflect scenarios that are not present in the WECARE study that inspired this work. For example, the case-control strata were initially counter-matched on a surrogate measure for radiation exposure. The counter-matching was used since the variable of interest was the true radiation exposure which was expected to be highly correlated with its surrogate measure and that, as previously mentioned, is exactly the setting in which the counter-matching outperforms simple random sampling. Later, a DNA repair gene became the variable of interest and there is no reason to believe that the assumption of independence could be violated. In fact, any covariate but the true radiation exposure is likely to be independent of the surrogate measure used as a counter-matching variable. Therefore, in similar settings, the full efficiency advantages of our proposed method are readily available.

We also assessed the performance of the proposed method under model misspecification of the underlying disease model via omission of a R×G interaction term. Values for the bias corresponding to a range of values of E(R), E(G), βR×G and several counter-matching scenarios with effect sizes fixed at βG=βR=log2 are shown in Table 3. For all designs, even moderately large effect sizes of the omitted interaction term introduced only small levels of bias.

Table 3:

Theorem 1 disease model misspecification numerical example. Bias as a function of model misspecification via omission of a R×G interaction term.

E(G)	E(R)	βR×G	Bias(βˆG1:1)	Bias(βˆG1:2)	Bias(βˆG2:1)	Bias(βˆG1:3)	Bias(βˆG2:2)	Bias(βˆG3:1)
0.1	0.1	−2.77	−0.01	0	−0.01	0	0	−0.01
0.1	0.1	−0.69	−0.01	0	−0.01	0	0	−0.01
0.1	0.1	0.69	0.01	0	0.01	0	0	0.01
0.1	0.1	1.39	0.03	0	0.04	0	0	0.04
0.1	0.1	2.77	0.1	0.01	0.13	0	0.02	0.14
0.1	0.2	−2.77	−0.02	−0.01	−0.02	0	−0.01	−0.02
0.1	0.2	−1.39	−0.01	0	−0.02	0	0	−0.02
0.1	0.2	−0.69	−0.01	0	−0.01	0	0	−0.01
0.1	0.2	0.69	0.02	0.01	0.02	0	0	0.02
0.1	0.2	1.39	0.05	0.01	0.05	0	0.01	0.05
0.1	0.2	2.77	0.14	0.04	0.17	0.01	0.04	0.16
0.2	0.1	−2.77	−0.02	0	−0.02	0	0	−0.02
0.2	0.1	−1.39	−0.01	0	−0.02	0	0	−0.02
0.2	0.1	−0.69	−0.01	0	−0.01	0	0	−0.01
0.2	0.1	0.69	0.02	0	0.02	0	0	0.02
0.2	0.1	1.39	0.04	0.01	0.05	0	0.01	0.06
0.2	0.1	2.77	0.13	0.02	0.16	0	0.02	0.17
0.2	0.2	−2.77	−0.03	−0.01	−0.04	0	−0.01	−0.03
0.2	0.2	−1.39	−0.02	−0.01	−0.03	0	−0.01	−0.03
0.2	0.2	−0.69	−0.02	0	−0.02	0	0	−0.02
0.2	0.2	0.69	0.03	0.01	0.03	0	0.01	0.03
0.2	0.2	1.39	0.07	0.02	0.08	0	0.02	0.07
0.2	0.2	2.77	0.17	0.05	0.19	0.01	0.04	0.18

Note: ^*Here βG=log(2) and βR=log(2).

Lastly, simultaneous violations of both the conditions of independence of R and G and misspecification of the underlying disease rates model via omission of a R×G interaction term, yielded similar results to those presented in the Table 2 and Table 3 and are therefore not shown here.

4 Analysis of deleterious BRCA1/2 mutations and contralateral breast cancer

To illustrate the methods, we consider estimation of the rate ratio for carriers of deleterious BRCA1/2 mutations from the Gene Susceptibility to Radiation Exposure for Second Breast Cancer Risk : the WECARE study data. Since radiation exposure is associated with UBC, causative genes and radiation exposure prior to UBC could be correlated in UBC patients. However, radiation therapy, the R variable in this analysis, occurs after UBC, so will likely not be correlated with G. Another possibility is that characteristics of the UBC, such as histology, stage, and size could be influenced by genotype as well as be factors in treatment decisions, thus leading to correlation. However, the WECARE UBC cohort is fairly uniform in stage of disease, UBC was defined as a first primary invasive breast cancer that did not spread beyond the regional lymph nodes at diagnosis, and we found no evidence of correlation between RT and histology or size of the UBC. Thus, we feel that independence is a reasonable assumption. The weighted analysis, including R and G (BRCA1/2 mutation) in the model, yielded G estimates βˆ=1.3, s.e.(βˆ) = 0.19, (RR = 3.8, 95% CI = (2.6–5.4)), while the unweighted analysis including just G yielded estimates βˆ=1.2, s.e.(βˆ) = 0.17, (RR = 3.3, 95% CI = (2.4–4.6)), representing about a 20% reduction in the variance in βˆ by using the unweighted analysis. In practice, once a G main effect is detected in an unweighted analysis, analysis of R×G interaction would be done using the counter-matched weighted likelihood.

4 Discussion

We mentioned the four large and impactful studies that have adopted the informative counter-matching sampling design due to the gains in statistical efficiency over simple random sampling that it provides. As their names suggest, these studies were initially designed to assess the effect of a particular exposure of interest. However, interest in new potential lifestyle, environmental and genetic risk factors necessitate second stage analyses of these datasets with respect to new covariates. However, in the likely case that the second stage covariates are independent of the counter-matching variable, the proper weighted analysis will lose efficiency compared to simple random sampling. In an effort to remove the disadvantageous properties of the counter-matched data when re-used in analyses of covariates independent of the counter-matching variable, we proposed a naive unweighted analysis of the effect of a dichotomous variable under the condition of an additive disease rate model. We showed that this unweighted analysis will produce an unbiased estimate of the effect size and importantly, the variance of the estimator will be smaller than that of the proper weighted analysis estimator. The gains occur for all counter-matching designs and are inversely related to the prevalence of the counter-matching variable and exceed 80% for prevalences of 0.05 and vanish for prevalences above 0.4. We expect the proposed method to enable several current and future studies to “recover” statistical efficiency in second stage analyses. Future endeavors need to explore possible extensions of our result to settings where the counter-matching variable is categorical with number of factors greater than two. A second and important potential application of the unweighted analysis of counter-matched case-control data ideas is at the design stage of a study. In this setting, researchers have to decide (among many other things) whether or not to counter-match on some variable(s), especially if there is the potential to consider a broad range of exposures down the road. Thus, if a future comparison of an unweighted analysis of counter-matched case-control data with a standard analysis of data arising from a nested case-control study shows a comparable power for exposures that are independent of the (proposed) counter-matching variable, then the message would be that one “wins” on all fronts: counter-matching coupled with an appropriate weighted analysis gives efficiency gains for the primary exposure of interest and no efficiency loss for other exposures if an unweighted analysis is performed.

Funding statement: Funding: This work was supported by NCI grants CA42949 and NIEHS grant 5P30 ES07048. The WECARE study is supported by NCI grants CA97397, CA83178, and CA98438. We thank the WECARE Study Group for making the BRCA1/2 and radiation data available.

References

1. Langholz B, Borgan O. Counter-matching: a stratified nested case-control sampling method. Biometrika 1995;82:69–79.10.1093/biomet/82.1.69Search in Google Scholar

2. Langholz B, Clayton D. Sampling strategies in nested case-control studies. Envir Health Perspect 1994;102:47–51.10.1289/ehp.94102s847Search in Google Scholar PubMed PubMed Central

3. Langholz B, Goldstein L. Risk set sampling in epidemiologic cohort studies. Statist Sci 1996;11:35–53.10.1214/ss/1032209663Search in Google Scholar

4. Bernstein J, Langholz B, Haile R. Study design: evaluating gene-environment interactions in the etiology of breast cancer – the wecare study. Breast Cancer Res 2004;6:199–214.10.1186/bcr771Search in Google Scholar PubMed PubMed Central

5. Steenland K, Deddens JA. Increased precision using counter-matching in nested case-control studies. Epidemiology 1997;8:238–42.10.1097/00001648-199705000-00002Search in Google Scholar PubMed

6. Cologne JB, Sharp GB, Neriishi K, Verkasalo PK, Land CE, Nakachi K. Improving the efficiency of nested case-control studies of interaction by selecting controls using counter matching on exposure. Int J Epidemiol 2004;33:485–92.10.1093/ije/dyh097Search in Google Scholar PubMed

7. Langholz B, Goldstein L. Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2001;2:63–84.10.1093/biostatistics/2.1.63Search in Google Scholar PubMed

8. Oakes D. Survival times: aspects of partial likelihood. Int Statist Rev 1981;49:235–64.10.2307/1402606Search in Google Scholar

9. Langholz B. Use of cohort information in the design and analysis of case-control studies. Scand J Stat 2007;34:120–36.10.1111/j.1467-9469.2006.00548.xSearch in Google Scholar

10. Breslow N, Lubin J, Marek P, Langholz B. Multiplicative models and cohort analysis. J Amer Statist Assoc 1983;78:1–12.10.1080/01621459.1983.10477915Search in Google Scholar

11. Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the cox regression model. Ann Statist 1992;20:1669–2195.10.1214/aos/1176348895Search in Google Scholar

Published Online: 2015-8-26

Published in Print: 2015-11-1

Articles in the same Issue

https://doi.org/10.1515/ijb-2014-0018

Keywords for this article

efficiency; partial likelihood; proportional hazards; sampling