Abstract
Informative sampling based on counter-matching risk set subjects on exposure correlated with a variable of interest has been shown to be an efficient alternative to simple random sampling; however, the opposite is true when correlation between the two covariates is absent. Thus, the counter-matching design will entail substantial gains in statistical efficiency compared to simple random sampling at a first stage of analyses focused by design on variables correlated with the counter-matching variable but will lose efficiency at a second stage of analyses aimed at variables independent of the counter-matching variable and not conceived as a part of the initial study. In an effort to recover efficiency in such second stage of analyses scenarios, we considered a naive analysis of the effect of a dichotomous covariate on the disease rates in the population that ignores the underlying counter-matching sampling design. We derive analytical expressions for the bias and variance and show that when the counter-matching and the new dichotomous variable of interest are uncorrelated and a multiplicative main effects model holds, such an analysis is advantageous over the standard “weighted” approach, especially when the counter-matching variable is rare and in such scenarios the efficiency gains exceeds 80%. Moreover, we consider all possible conceptual violations of the required assumptions and show that moderate departures from the above-mentioned requirements lead to negligible levels of bias; numerical values for the bias under common scenarios are provided. The method is illustrated via an analysis of BRCA1/2 deleterious mutations in the radiation treatment counter-matched WECARE study of second breast cancer.
1 Introduction
The counter-matching design is an exposure-stratified nested case-control study method that incorporates exposure information, available on all cohort subjects, into the case-control sample. Every counter-matching design, say CM(
Previous studies have shown that generally, counter-matching is statistically more efficient than simple random sampling of controls for estimation of factors well correlated with the counter-matching variable, but less efficient for variables that are not well correlated with the counter-matching variable [1–3]. For example, in the WECARE study of radiation, genes, and asynchronous contralateral (CBC) breast cancer, unilateral breast cancer (UBC) controls were 1:2 counter-matched, CM(1:2), on radiation treatment status to each CBC case. It was demonstrated that the counter-matched design was more efficient than simple random sampling of two controls, SRS(3), for estimation of radiation main effects and gene by radiation interactions [4], but there is some loss in efficiency for the main effect of gene. Let R and G be indicator variables for radiation treatment and gene carrier status with R be the counter-matching variable and G be assessed only on the sample. Then, in standard proportional hazards models, counter-matching is more efficient than random sampling for assessing the main effect of R and
Several other important studies besides the WECARE, such as the Crystalline Silica Exposure and Silicosis in Gold Miners [5], the Radiation, Hormones, and Breast Cancer of Japanese Atomic Bomb Survivors [6], Early Asthma Risk Study (EARS) of In Utero and Early Life Exposures and Asthma [7], have implemented the counter-matching sampling design to exploit its inherent gains in statistical efficiency over simple random sampling at a first stage of analyses with respect to the initial exposures of interest by counter-matching on the covariates correlated with the these exposures of interest. However, at a later point the effects of new lifestyle, environmental and genetic risk factors could become desirable research objectives and a second stage analyses of these datasets with respect to new covariates will be required. In the likely case that the second stage covariates are independent of the counter-matching variable, the proper weighted analysis will lose efficiency compared to the case if the data were obtained under simple random sampling. Under mild conditions, the approach proposed in this work will enable these and future counter-matched studies to remove the disadvantageous properties of the counter-matched data when re-analyzed at a later point with respect to covariates independent of the counter-matching variable.
In this paper, we consider dichotomous covariates G and R and explore the estimation of the effect of G from data
2 Methods
We rely on the nested case-control model for individually matched case-control studies in which controls are sampled from the risk sets at the failure times [3, 8]. While the counting process formulation would fully account of the failure time component of this problem, it is the sampling that is relevant and to keep the notation to a minimum, we analyze the contribution from a single case-control set and drop reference to time [9]. Let
the “unweighted” likelihood contribution as from a SRS(m), that includes only G in the model and
The following theorem provides the essential part of the results in this paper.
Let G and R be independent binary covariates that have multiplicative effects on the disease rates of the population. Assume that the data have been obtained through an implementation of a
Proof. By definition, the disease risks and likelihood contributions from strata are given by
where
where
Therefore, the proofs of both the unbiasedness and variance statements in Theorem 1 are predicated on an evaluation of expressions of the type:
where
Let
A change in the order of summation allows us to rewrite eq. (5) in the following way,
After an explicit enumeration of all possible ways of obtaining samples
where for brevity we use
We are interested in assessing the following limit
Further, under the assumption of independence between R and G,
Now, we view the expression (9) as a homogeneous polynomial of the type
Lastly, using the combinatorial equality
This completes the proof of Lemma 1.
In the notation for the remainder of the paper, we suppress the dependence of
Next, we prove the unbiasedness statement of Theorem 1. To evidence the validity of this part of theorem, it suffices to show that
It is easily seen that
Finally, we prove the variance statement of Theorem 1. An application of Lemma 1 with
However, after trivial simplifications (12) reduces to
This completes the proof of the variance statement since all the coefficients of the polynomial associated with expression (13) match those in front of the same monomials in the asymptotic variance
Consequently, we have shown that under the conditions of Theorem 1 the unweighted analysis of the data is a proper statistical tool for analysis of counter-matched data. More importantly, there is a substantial gain of statistical efficiency associated with the implementation of such unweighted analysis under these assumptions and that corroborates previous results regarding case-control data counter-matched on an exposure independent of the covariate of interest [3].
3 Results
In an effort to quantify the gains in efficiency attributable to the advantageous properties of the proposed unweighted analysis, we implemented a large scale simulation study by varying counter-matched designs CM(
Theorem 1 numerical example. ARE results for independent G and R showing the efficiency gains of the unweighted analysis of 1:2 counter-matched data over the proper, weighted analysis.
log(2)/2 | log(2)/2 | 0.05 | 0.10 | 0.36 | 0.08 | 0.21 |
log(2)/2 | log(2)/2 | 0.10 | 0.10 | 0.22 | 0.07 | 0.34 |
log(2)/2 | log(2)/2 | 0.40 | 0.10 | 0.07 | 0.07 | 0.92 |
log(2)/2 | log(2)/2 | 0.05 | 0.30 | 0.20 | 0.03 | 0.17 |
log(2)/2 | log(2)/2 | 0.10 | 0.30 | 0.10 | 0.03 | 0.35 |
log(2)/2 | log(2)/2 | 0.40 | 0.30 | 0.04 | 0.04 | 0.88 |
log(2)/2 | log(2) | 0.05 | 0.10 | 0.28 | 0.07 | 0.25 |
log(2)/2 | log(2) | 0.10 | 0.10 | 0.16 | 0.07 | 0.45 |
log(2)/2 | log(2) | 0.40 | 0.10 | 0.08 | 0.07 | 0.96 |
log(2)/2 | log(2) | 0.05 | 0.30 | 0.15 | 0.03 | 0.22 |
log(2)/2 | log(2) | 0.10 | 0.30 | 0.08 | 0.04 | 0.47 |
log(2)/2 | log(2) | 0.40 | 0.30 | 0.04 | 0.04 | 0.96 |
log(2) | log(2)/2 | 0.05 | 0.10 | 0.36 | 0.07 | 0.19 |
log(2) | log(2)/2 | 0.10 | 0.10 | 0.17 | 0.07 | 0.39 |
log(2) | log(2)/2 | 0.40 | 0.10 | 0.07 | 0.07 | 0.92 |
log(2) | log(2)/2 | 0.05 | 0.30 | 0.19 | 0.03 | 0.18 |
log(2) | log(2)/2 | 0.10 | 0.30 | 0.09 | 0.03 | 0.34 |
log(2) | log(2)/2 | 0.40 | 0.30 | 0.04 | 0.03 | 0.92 |
log(2) | log(2) | 0.05 | 0.10 | 0.25 | 0.06 | 0.24 |
log(2) | log(2) | 0.10 | 0.10 | 0.15 | 0.07 | 0.45 |
log(2) | log(2) | 0.40 | 0.10 | 0.07 | 0.07 | 0.97 |
log(2) | log(2) | 0.05 | 0.30 | 0.14 | 0.03 | 0.24 |
log(2) | log(2) | 0.10 | 0.30 | 0.07 | 0.04 | 0.47 |
log(2) | log(2) | 0.40 | 0.30 | 0.04 | 0.03 | 0.98 |
Note: *Empirically estimated via 1000 datasets containing 200 counter-matched trios with each trio sampled from a risk set of size 500.
The condition of independence between G and R is crucial to the correctness of the conclusions in Theorem 1. Deviations from this assumption induce bias in all
Proof. As mentioned before, we need to solve eq. (1) with respect to
where
which can be rewritten in terms of the prevalence of G by substituting
Corollary 1 numerical example. Bias as a function of the correlation between R and G induced by sensitivity, specificity and their prevalences for various counter-matching designs.
0.05 | 0.85 | 0.1 | 0.14 | 0.02 | 0.01 | 0.01 | 0 | 0 | 0 |
0.05 | 0.9 | 0.1 | 0.095 | 0.02 | 0 | 0.01 | 0 | 0 | 0 |
0.05 | 0.95 | 0.1 | 0.05 | 0 | 0 | 0 | 0 | 0 | 0 |
0.1 | 0.85 | 0.1 | 0.145 | 0.01 | 0 | 0 | 0 | 0 | 0 |
0.1 | 0.9 | 0.1 | 0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
0.1 | 0.95 | 0.1 | 0.055 | −0.04 | 0 | −0.02 | 0 | 0 | −0.02 |
0.15 | 0.85 | 0.1 | 0.15 | 0 | 0 | 0 | 0 | 0 | 0 |
0.15 | 0.9 | 0.1 | 0.105 | −0.02 | 0 | −0.01 | 0 | 0 | 0 |
0.15 | 0.95 | 0.1 | 0.06 | −0.07 | −0.01 | −0.04 | 0 | 0 | −0.03 |
0.15 | 0.75 | 0.2 | 0.23 | 0.01 | 0.01 | 0 | 0 | 0 | −0.01 |
0.15 | 0.8 | 0.2 | 0.19 | 0.01 | 0 | 0 | 0 | 0 | 0 |
0.15 | 0.85 | 0.2 | 0.15 | 0 | 0 | 0 | 0 | 0 | 0 |
0.2 | 0.75 | 0.2 | 0.24 | 0.01 | 0 | 0 | 0 | 0 | 0 |
0.2 | 0.8 | 0.2 | 0.2 | 0 | 0 | 0 | 0 | 0 | 0 |
0.2 | 0.85 | 0.2 | 0.16 | −0.01 | 0 | 0 | 0 | 0 | 0 |
0.25 | 0.75 | 0.2 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 |
0.25 | 0.8 | 0.2 | 0.21 | −0.01 | 0 | 0 | 0 | 0 | 0 |
0.25 | 0.85 | 0.2 | 0.17 | −0.03 | −0.01 | 0 | 0 | 0 | 0 |
Note:*Here
The conclusions for correlated covariates shown in Corollary 1 are presented for completeness of exposition and reflect scenarios that are not present in the WECARE study that inspired this work. For example, the case-control strata were initially counter-matched on a surrogate measure for radiation exposure. The counter-matching was used since the variable of interest was the true radiation exposure which was expected to be highly correlated with its surrogate measure and that, as previously mentioned, is exactly the setting in which the counter-matching outperforms simple random sampling. Later, a DNA repair gene became the variable of interest and there is no reason to believe that the assumption of independence could be violated. In fact, any covariate but the true radiation exposure is likely to be independent of the surrogate measure used as a counter-matching variable. Therefore, in similar settings, the full efficiency advantages of our proposed method are readily available.
We also assessed the performance of the proposed method under model misspecification of the underlying disease model via omission of a
Theorem 1 disease model misspecification numerical example. Bias as a function of model misspecification via omission of a
0.1 | 0.1 | −2.77 | −0.01 | 0 | −0.01 | 0 | 0 | −0.01 |
0.1 | 0.1 | −0.69 | −0.01 | 0 | −0.01 | 0 | 0 | −0.01 |
0.1 | 0.1 | 0.69 | 0.01 | 0 | 0.01 | 0 | 0 | 0.01 |
0.1 | 0.1 | 1.39 | 0.03 | 0 | 0.04 | 0 | 0 | 0.04 |
0.1 | 0.1 | 2.77 | 0.1 | 0.01 | 0.13 | 0 | 0.02 | 0.14 |
0.1 | 0.2 | −2.77 | −0.02 | −0.01 | −0.02 | 0 | −0.01 | −0.02 |
0.1 | 0.2 | −1.39 | −0.01 | 0 | −0.02 | 0 | 0 | −0.02 |
0.1 | 0.2 | −0.69 | −0.01 | 0 | −0.01 | 0 | 0 | −0.01 |
0.1 | 0.2 | 0.69 | 0.02 | 0.01 | 0.02 | 0 | 0 | 0.02 |
0.1 | 0.2 | 1.39 | 0.05 | 0.01 | 0.05 | 0 | 0.01 | 0.05 |
0.1 | 0.2 | 2.77 | 0.14 | 0.04 | 0.17 | 0.01 | 0.04 | 0.16 |
0.2 | 0.1 | −2.77 | −0.02 | 0 | −0.02 | 0 | 0 | −0.02 |
0.2 | 0.1 | −1.39 | −0.01 | 0 | −0.02 | 0 | 0 | −0.02 |
0.2 | 0.1 | −0.69 | −0.01 | 0 | −0.01 | 0 | 0 | −0.01 |
0.2 | 0.1 | 0.69 | 0.02 | 0 | 0.02 | 0 | 0 | 0.02 |
0.2 | 0.1 | 1.39 | 0.04 | 0.01 | 0.05 | 0 | 0.01 | 0.06 |
0.2 | 0.1 | 2.77 | 0.13 | 0.02 | 0.16 | 0 | 0.02 | 0.17 |
0.2 | 0.2 | −2.77 | −0.03 | −0.01 | −0.04 | 0 | −0.01 | −0.03 |
0.2 | 0.2 | −1.39 | −0.02 | −0.01 | −0.03 | 0 | −0.01 | −0.03 |
0.2 | 0.2 | −0.69 | −0.02 | 0 | −0.02 | 0 | 0 | −0.02 |
0.2 | 0.2 | 0.69 | 0.03 | 0.01 | 0.03 | 0 | 0.01 | 0.03 |
0.2 | 0.2 | 1.39 | 0.07 | 0.02 | 0.08 | 0 | 0.02 | 0.07 |
0.2 | 0.2 | 2.77 | 0.17 | 0.05 | 0.19 | 0.01 | 0.04 | 0.18 |
Note: *Here
Lastly, simultaneous violations of both the conditions of independence of R and G and misspecification of the underlying disease rates model via omission of a
4 Analysis of deleterious BRCA1/2 mutations and contralateral breast cancer
To illustrate the methods, we consider estimation of the rate ratio for carriers of deleterious BRCA1/2 mutations from the Gene Susceptibility to Radiation Exposure for Second Breast Cancer Risk : the WECARE study data. Since radiation exposure is associated with UBC, causative genes and radiation exposure prior to UBC could be correlated in UBC patients. However, radiation therapy, the R variable in this analysis, occurs after UBC, so will likely not be correlated with G. Another possibility is that characteristics of the UBC, such as histology, stage, and size could be influenced by genotype as well as be factors in treatment decisions, thus leading to correlation. However, the WECARE UBC cohort is fairly uniform in stage of disease, UBC was defined as a first primary invasive breast cancer that did not spread beyond the regional lymph nodes at diagnosis, and we found no evidence of correlation between RT and histology or size of the UBC. Thus, we feel that independence is a reasonable assumption. The weighted analysis, including R and G (BRCA1/2 mutation) in the model, yielded G estimates
4 Discussion
We mentioned the four large and impactful studies that have adopted the informative counter-matching sampling design due to the gains in statistical efficiency over simple random sampling that it provides. As their names suggest, these studies were initially designed to assess the effect of a particular exposure of interest. However, interest in new potential lifestyle, environmental and genetic risk factors necessitate second stage analyses of these datasets with respect to new covariates. However, in the likely case that the second stage covariates are independent of the counter-matching variable, the proper weighted analysis will lose efficiency compared to simple random sampling. In an effort to remove the disadvantageous properties of the counter-matched data when re-used in analyses of covariates independent of the counter-matching variable, we proposed a naive unweighted analysis of the effect of a dichotomous variable under the condition of an additive disease rate model. We showed that this unweighted analysis will produce an unbiased estimate of the effect size and importantly, the variance of the estimator will be smaller than that of the proper weighted analysis estimator. The gains occur for all counter-matching designs and are inversely related to the prevalence of the counter-matching variable and exceed 80% for prevalences of 0.05 and vanish for prevalences above 0.4. We expect the proposed method to enable several current and future studies to “recover” statistical efficiency in second stage analyses. Future endeavors need to explore possible extensions of our result to settings where the counter-matching variable is categorical with number of factors greater than two. A second and important potential application of the unweighted analysis of counter-matched case-control data ideas is at the design stage of a study. In this setting, researchers have to decide (among many other things) whether or not to counter-match on some variable(s), especially if there is the potential to consider a broad range of exposures down the road. Thus, if a future comparison of an unweighted analysis of counter-matched case-control data with a standard analysis of data arising from a nested case-control study shows a comparable power for exposures that are independent of the (proposed) counter-matching variable, then the message would be that one “wins” on all fronts: counter-matching coupled with an appropriate weighted analysis gives efficiency gains for the primary exposure of interest and no efficiency loss for other exposures if an unweighted analysis is performed.
Funding statement: Funding: This work was supported by NCI grants CA42949 and NIEHS grant 5P30 ES07048. The WECARE study is supported by NCI grants CA97397, CA83178, and CA98438. We thank the WECARE Study Group for making the BRCA1/2 and radiation data available.
References
1. Langholz B, Borgan O. Counter-matching: a stratified nested case-control sampling method. Biometrika 1995;82:69–79.10.1093/biomet/82.1.69Search in Google Scholar
2. Langholz B, Clayton D. Sampling strategies in nested case-control studies. Envir Health Perspect 1994;102:47–51.10.1289/ehp.94102s847Search in Google Scholar PubMed PubMed Central
3. Langholz B, Goldstein L. Risk set sampling in epidemiologic cohort studies. Statist Sci 1996;11:35–53.10.1214/ss/1032209663Search in Google Scholar
4. Bernstein J, Langholz B, Haile R. Study design: evaluating gene-environment interactions in the etiology of breast cancer – the wecare study. Breast Cancer Res 2004;6:199–214.10.1186/bcr771Search in Google Scholar PubMed PubMed Central
5. Steenland K, Deddens JA. Increased precision using counter-matching in nested case-control studies. Epidemiology 1997;8:238–42.10.1097/00001648-199705000-00002Search in Google Scholar PubMed
6. Cologne JB, Sharp GB, Neriishi K, Verkasalo PK, Land CE, Nakachi K. Improving the efficiency of nested case-control studies of interaction by selecting controls using counter matching on exposure. Int J Epidemiol 2004;33:485–92.10.1093/ije/dyh097Search in Google Scholar PubMed
7. Langholz B, Goldstein L. Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2001;2:63–84.10.1093/biostatistics/2.1.63Search in Google Scholar PubMed
8. Oakes D. Survival times: aspects of partial likelihood. Int Statist Rev 1981;49:235–64.10.2307/1402606Search in Google Scholar
9. Langholz B. Use of cohort information in the design and analysis of case-control studies. Scand J Stat 2007;34:120–36.10.1111/j.1467-9469.2006.00548.xSearch in Google Scholar
10. Breslow N, Lubin J, Marek P, Langholz B. Multiplicative models and cohort analysis. J Amer Statist Assoc 1983;78:1–12.10.1080/01621459.1983.10477915Search in Google Scholar
11. Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the cox regression model. Ann Statist 1992;20:1669–2195.10.1214/aos/1176348895Search in Google Scholar
© 2015 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Use of Depth Measure for Multivariate Functional Data in Disease Prediction: An Application to Electrocardiograph Signals
- Structural Nested Mean Models to Estimate the Effects of Time-Varying Treatments on Clustered Outcomes
- A post-hoc Unweighted Analysis of Counter-Matched Case-Control Data
- Targeted Maximum Likelihood Estimation using Exponential Families
- Multiple-Objective Optimal Designs for Studying the Dose Response Function and Interesting Dose Levels
- A Semiparametric Bayesian Approach for Analyzing Longitudinal Data from Multiple Related Groups
- Functional and Parametric Estimation in a Semi- and Nonparametric Model with Application to Mass-Spectrometry Data
Articles in the same Issue
- Frontmatter
- Research Articles
- Use of Depth Measure for Multivariate Functional Data in Disease Prediction: An Application to Electrocardiograph Signals
- Structural Nested Mean Models to Estimate the Effects of Time-Varying Treatments on Clustered Outcomes
- A post-hoc Unweighted Analysis of Counter-Matched Case-Control Data
- Targeted Maximum Likelihood Estimation using Exponential Families
- Multiple-Objective Optimal Designs for Studying the Dose Response Function and Interesting Dose Levels
- A Semiparametric Bayesian Approach for Analyzing Longitudinal Data from Multiple Related Groups
- Functional and Parametric Estimation in a Semi- and Nonparametric Model with Application to Mass-Spectrometry Data