Home Selection bias and multiple inclusion criteria in observational studies
Article Open Access

Selection bias and multiple inclusion criteria in observational studies

  • Stina Zetterstrom EMAIL logo and Ingeborg Waernbaum
Published/Copyright: December 21, 2022
Become an author with De Gruyter Brill

Abstract

Objectives

Spurious associations between an exposure and outcome not describing the causal estimand of interest can be the result of selection of the study population. Recently, sensitivity parameters and bounds have been proposed for selection bias, along the lines of sensitivity analysis previously proposed for bias due to unmeasured confounding. The basis for the bounds is that the researcher specifies values for sensitivity parameters describing associations under additional identifying assumptions. The sensitivity parameters describe aspects of the joint distribution of the outcome, the selection and a vector of unmeasured variables, for each treatment group respectively. In practice, selection of a study population is often made on the basis of several selection criteria, thereby affecting the proposed bounds.

Methods

We extend the previously proposed bounds to give additional guidance for practitioners to construct i) the sensitivity parameters for multiple selection variables and ii) an alternative assumption free bound, producing only logically feasible values. As a motivating example we derive the bounds for causal estimands in a study of perinatal risk factors for childhood onset Type 1 Diabetes Mellitus where selection of the study population was made by multiple inclusion criteria. To give further guidance for practitioners, we provide a data learner in R where both the sensitivity parameters and the assumption-free bounds are implemented.

Results

The assumption-free bounds can be both smaller and larger than the previously proposed bounds and can serve as an indicator of settings when the former bounds do not produce feasible values. The motivating example shows that the assumption-free bounds may not be appropriate when the outcome or treatment is rare.

Conclusions

Bounds can provide guidance in a sensitivity analysis to assess the magnitude of selection bias. Additional knowledge is used to produce values for sensitivity parameters under multiple selection criteria. The computation of values for the sensitivity parameters is complicated by the multiple inclusion/exclusion criteria, and a data learner in R is provided to facilitate their construction. For comparison and assessment of the feasibility of the bound an assumption free bound is provided using solely underlying assumptions in the framework of potential outcomes.

Introduction

In observational studies, there are several sources of potential biases when estimating a causal effect of a treatment on an outcome of interest. Researchers have addressed two long-familiar types of biases: bias due to unmeasured confounding and selection bias. Implications of violation of the assumption of no unmeasured confounding have received significant attention in the scientific community, and for confounding bias there are different strategies to conduct sensitivity analysis. Examples are using negative controls (Lipsitch, Tchetgen, and Cohen 2010), calculation of the E-value (VanderWeele and Ding 2017), Bayesian sensitivity analysis (McCandless, Gustafson, and Levy 2007) and a class of methods resulting in the construction of bounds, see for example Manski (1990) for an early contribution. Other examples of bounds are, Flanders and Khoury (1990), Lee (2011), and Ding and VanderWeele (2016) who evaluate bounds for the bias for different parameter values in the data generating process. In other examples, the bounds depend only on the observed data (MacLehose et al. 2005; Robins 1989; Sjölander 2020). The assumption-free bounds proposed by Sjölander (2020) included a comparison to the bounds derived by Ding and VanderWeele (2016), and described a sharp region for the latter. In this work we explore the bound strategy for the case of selection bias.

From the research question and the specification of the study population, subjects are included or excluded in the study based on different selection criteria, see Goetghebeur et al. (2020) for a discussion on the role of the study population when targeting a causal estimand. Commonly, the selected subjects can be viewed as a subpopulation of the total population. The selection criteria can be formed from a single, or more often, from multiple variables. If the criteria implies that the causal estimand is not identifiable from the data, the inclusion/exclusion results in selection bias. A review of selection bias was presented by Hernán, Hernández-Díaz, and Robins (2004) with real-world examples and corresponding causal graphs. Also, strategies to prevent selection bias in the design stage and how to correct for it in the analysis were discussed. Recently, Smith (2020) presented a review of selection bias where, in addition to real-world examples and design and analysis strategies, sensitivity analysis similar to those presented for unmeasured confounding are discussed.

Selection bias can have different sources depending on the target population of the study. If the study aims at generalizing its results to the total population, selection bias arise whenever the selection criteria alters the causal estimand (Hernán 2017). On the other hand, if the interest lies in the studied subpopulation, selection bias arise if the selection is made on variables that are dependent, directly or indirectly, of the treatment and outcome in a dependence structure involving colliders.

Two dependence structures, the M-structure and the butterfly structure (Greenland, Pearl, and Robins 1999), are specific types of model structures where selection bias can arise. Liu et al. (2012) and Ding and Miratrix (2015) studied selection bias for these two model structures. Through simulation studies, they observed that selection bias is often less severe than the bias that arise from unmeasured confounding. However, there are settings in which the selection bias can be substantial. Similar results were presented for simulated cohort studies (Pizzi et al. 2011; Whitcomb and McArdle 2016). Whitcomb and McArdle (2016) demonstrated that the selection bias should be considered if the selection is affected by an interaction effect between treatment and an unmeasured risk factor for the outcome. The same conclusion was reached by Mayeda et al. (2016) for survival bias. These, and similar studies, motivates the need for further sensitivity analyses for selection bias.

Several bounds for selection bias have been presented. The bounds are constructed based on the strengths of the dependencies in the data generating process. For instance, Greenland (2003) specified a maximum of the M-bias for the odds ratio under the assumption that the causal odds ratios (i.e. the odds ratios between connected nodes in the M-structure) are equal. This assumption limits the applicability of the derived bound. Huang and Lee (2015) derived a joint bound for the confounding odds ratio (the ratio between the observed odds ratio and the standardized odds ratio) for multiple sources of biases, including selection bias. Flanders and Ye (2019) derived a bound for the selection bias for the ratio between the observed risk ratio and the standardized risk ratio in the M-structure and other similar structures. Huang and Lee (2015) and Flanders and Ye (2019) give examples of when their bounds are valid, but neither provide sufficient conditions. For a binary outcome and treatment, Smith and VanderWeele (2019) proposed bounds for the selection bias for the relative risk and the risk difference in the total population and in the selected subpopulation. The bounds are valid under conditional independence assumptions involving unmeasured variables, the selection variable and the outcome. Furthermore, and similar to proposals for sensitivity analyses for unmeasured confounding, the bounds rely on proposed values of sensitivity parameters given by the researcher. The values given should ideally be based on scientific judgment using subject matter knowledge, however, involving unmeasured variables not available in the data at hand. A combination of bounds for unmeasured confounding, measurement error and selection bias was proposed by Smith, Mathur, and VanderWeele (2021) together with an implementation in the R-package EValue.

In this paper we build on the results of Smith and VanderWeele (2019), hereafter referred to as SV, by applying the bounds on a study of perinatal risk factors of type 1 diabetes mellitus (Waernbaum, Dahlquist, and Lind 2019). A study population was selected from an incidence register through inclusion criteria defined by several variables, describing a common setting facing the practicing data-analyst. The estimands of interest are the causal relative risk in both the total and the selected population. In contrast to the requirement of providing sensitivity parameters, which can be a challenging task for the researcher, we additionally construct an assumption-free bound, along the lines of Sjölander (2020). The assumption-free bounds only incorporate restrictions that are given by the observed data. We find that the assumption-free bounds can be smaller or larger than the bounds proposed by SV. To illustrate our results and to give a tool for practitioners, we provide a numerical learner in R with a prototypical data generating process producing bounds for selection bias.

The paper is outlined as follows. First, we present the model, notation and introduce selection bias for the causal estimands under study. Second, the bias bounds derived by SV are described and extended for the multiple selection case using an illustrative example. Next, the assumption-free bounds are derived. Numerical examples of the bounds are presented along with the numerical learner and data example. Finally, the results of the paper are discussed.

Causal framework and selection bias

In this work, the selection bias is evaluated for a set of causal estimands. These causal estimands are defined using potential outcomes (Rubin 1974). We assume that each subject has a binary treatment, T ∈ {0, 1}, that is 1 if a subject is treated and 0 otherwise, and binary potential outcomes, Y(t) ∈ {0, 1}, for t=0, 1. For each subject, we observe a vector of pre-treatment covariates, X, and binary selection variables, S k ∈ {0, 1}, k=1, …, K from which we define a selection indicator function:

I S = 1 if  k = 1 K S k = 1 0 otherwise .

The selection indicator captures the study population construction from multiple selection variables. A subject is only included in the study if all selection variables are equal to 1, and excluded from the study if at least one selection variable is equal to 0. The construction of the selection indicator variable for the special cases when K=1 and K=2 are illustrated in the flow charts in Figure 1.

Figure 1: 
Flow chart illustrating prototypical single (A) and multiple (B) selections for a study population.
Figure 1:

Flow chart illustrating prototypical single (A) and multiple (B) selections for a study population.

Throughout the paper we assume consistency, i.e. the observed outcome, Y, is the potential outcome under the received treatment, Y=Y(T), and we refer to the value of Y=1 as “success”. We further assume a vector of unobserved covariates, denoted U. In the following we assume conditional exchangeability in the total population:

Assumption 1

Y(t) ⊥⊥ TX, t=0, 1.

This means that the potential outcomes in the total population are independent of the treatment, conditional on the observed covariates. Following the previous papers, the covariates are suppressed and throughout we assume that derivations are performed within strata of covariates X.

Selection bias is used to describe to which extent an estimated parameter differs from the true estimand in the study population, due to selection on one or several observed variables. We focus on two common causal estimands for a binary treatment and binary outcome, the causal relative risk and the causal risk difference in both the total population,

(1) β R = P ( Y ( 1 ) = 1 ) P ( Y ( 0 ) = 1 )

(2) β D = P ( Y ( 1 ) = 1 ) P ( Y ( 0 ) = 1 ) ,

and in the selected subpopulation,

(3) β R S = P ( Y ( 1 ) = 1 | I S = 1 ) P ( Y ( 0 ) = 1 | I S = 1 )

(4) β D S = P ( Y ( 1 ) = 1 | I S = 1 ) P ( Y ( 0 ) = 1 | I S = 1 ) .

Using the observed outcome, Y, under selection, I S =1, an estimator would target the estimands β R obs and β D obs defined as

(5) β R obs = P ( Y = 1 | T = 1 , I S = 1 ) P ( Y = 1 | T = 0 , I S = 1 ) ,

(6) β D obs = P ( Y = 1 | T = 1 , I S = 1 ) P ( Y = 1 | T = 0 , I S = 1 ) .

We refer to these as the observed estimands due to the observed outcomes, even though they are unknown population quantities. Note that β R obs is the same for both β R and β R S , and that β D obs is the same for both β D and β D S . Similar to SV and Sjölander (2020) we ignore the sampling variability and treat the observed means as the corresponding population estimands in the presentation of the bounds, since they are the asymptotic counterparts of the sample means, see Appendix A.

The selection, I S =1, can bias the estimation in two ways. First, the causal estimands in the total population and the subpopulation might not be the same, for example β R β R S , meaning that even if the subpopulation parameter can be consistently estimated, generalizations from the subpopulation to the total population cannot be done. Second, if (conditional) exchangeability is violated in the subpopulation, i.e. the causal estimands (3) and (4) in the subpopulation are not identified with the observed data

P ( Y = 1 | T = t , I S = 1 ) = P ( Y ( t ) = 1 | T = t , I S = 1 ) P ( Y ( t ) = 1 | I S = 1 ) ,

where a typical violation of the last equality is due to conditioning on colliders. The selection bias for each causal estimand is defined in Table 1. For the causal relative risks, β R and β R S , the selection bias is defined as a ratio, and for the causal risk differences, β D and β D S the selection bias is defined as a difference.

Table 1:

Definition and selection bias for β R , β D , β R S and β D S .

Estimand Causal Observed Bias
Rel. risk, tot. pop. β R = P ( Y ( 1 ) = 1 ) P ( Y ( 0 ) = 1 ) β R obs = P ( Y = 1 | T = 1 , I S = 1 ) P ( Y = 1 | T = 0 , I S = 1 ) b i a s ( β R ) = β R obs β R
Risk diff., tot. pop. β D = P(Y(1)=1) − P(Y(0)=1) β D obs = P ( Y = 1 | T = 1 , I S = 1 ) P(Y=1|T=0, I S =1) b i a s ( β D ) = β D obs β D
Rel. risk, subpop. β R S = P ( Y ( 1 ) = 1 | I S = 1 ) P ( Y ( 0 ) = 1 | I S = 1 ) β R obs = P ( Y = 1 | T = 1 , I S = 1 ) P ( Y = 1 | T = 0 , I S = 1 ) b i a s ( β R S ) = β R obs β R S
Risk diff., subpop. β D S = P ( Y ( 1 ) = 1 | I S = 1 ) P(Y(0)=1|I S =1) β D obs = P ( Y = 1 | T = 1 , I S = 1 ) P(Y=1|T=0, I S =1) b i a s ( β D S ) =

β D obs β D S

Since the causal estimand is unknown, the magnitude of the selection bias is unknown. SV define upper bounds for the selection bias defined in Table 1 under an assumption of a positive bias, i.e.

(7) θ obs > θ ,

where θ ( β R , β D , β R S , β D S ) and θ obs β R obs , β D obs . Note that (7) is a purely technical assumption, since if it does not hold, the coding for the treatment can be reversed. To calculate the bounds for the estimands in the total and in the selected subpopulation two different assumptions are needed.

Assumption 2

(Total population estimands β R and β D )

For some unmeasured variable(s) U: Y ⊥⊥ I S |(T=t, U=u), for t=0, 1.

Assumption 3

(Subpopulation estimands β R S and β D S )

For some unmeasured variable(s) U: Y(t) ⊥⊥ T|(I S =1, U=u), for t=0, 1.

In SV, the Assumptions 2 and 3 are stated for an unspecified dimension of U, although the bounds are described only for a single U. Below, we expand and further explore the bounds for the selection indicator I S and a vector U. The assumptions are illustrated for an extension of the M-structure in the directed acyclic graph (DAG) in Figure 2. The results of SV are derived without consideration of V. Here, we similarly do not explicitly include V in our assumptions.

Figure 2: 
Representation of an extension of the M-structure using a DAG for the variables (I

S
, T, U, V, Y).
Figure 2:

Representation of an extension of the M-structure using a DAG for the variables (I S , T, U, V, Y).

Single and multi-selection bounds

The bounds proposed by SV describe selection bias for a single selection variable, S. In this section, we first describe the SV bounds and secondly investigate how the bounds change when additional selections are made. We use the terms single-selection and multi-selection to describe these cases.

Bounds from Smith and VanderWeele (2019)

Hereafter, we denote the SV bounds by B ( ) for the corresponding estimand, for example, B ( β R ) is the bound for β R . Below, we outline the bounds and the parameters they are constructed from.

Total population

The bound for the relative risk in the total population depends on the independence of Y and I S conditional on U, for each treatment group. It is defined using four bias bound parameters (RR UY|T=t and RR SU|T=t , t=0, 1), tabulated in Table 2.

Table 2:

Definitions and interpretations for the bias bound parameters used in the bounds for β R and β D .

Bias bound parameter Definition Interpretation
RR UY|T=1 max u P ( Y = 1 | T = 1 , U = u ) min u P ( Y = 1 | T = 1 , U = u ) Maximum relative risk for Y=1 comparing two values of U within stratum T=1.
RR UY|T=0 max u P ( Y = 1 | T = 0 , U = u ) min u P ( Y = 1 | T = 0 , U = u ) Maximum relative risk for Y=1 comparing two values of U within stratum T=0.
RR SU|T=1 max u P ( U = u | T = 1 , I S = 1 ) P ( U = u | T = 1 , I S = 0 ) Maximum selection ratio with respect to U for the treated.
RR SU|T=0 max u P ( U = u | T = 0 , I S = 0 ) P ( U = u | T = 0 , I S = 1 ) Maximum non selection ratio with respect to U for the controls.

The bound is the product of two parts, the selection bias with respect to the treated group and the selection bias with respect to the control group. It is defined as

(8) B ( β R ) = B F 1 B F 0

where

B F 1 = R R U Y | T = 1 R R S U | T = 1 R R U Y | T = 1 + R R S U | T = 1 1 ,

B F 0 = R R U Y | T = 0 R R S U | T = 0 R R U Y | T = 0 + R R S U | T = 0 1 .

For the risk difference, the bound relies on the observed probability of success, in the selected part of the population, for each treatment group, as well as the four bias bound parameters in Table 2. It is given by

(9) B ( β D ) = B F 1 P ( Y = 1 | T = 1 , I S = 1 ) / B F 1 + P ( Y = 1 | T = 0 , I S = 1 ) B F 0 ,

where BF 1 and BF 0 are defined as above.

Subpopulation

For the selected subpopulation, the bound for the relative risk is constructed from the dependencies between U and Y, and U and T in the selected subpopulation. The bound depends on the parameters R R U Y | I S = 1 and R R T U | I S = 1 , tabulated in Table 3.

Table 3:

Definitions and interpretations for the bias bound parameters used in the bounds for β R S and β D S .

Bias bound Definition Interpretation
parameter
R R U Y | I S = 1 max t max u P ( Y = 1 | T = t , I S = 1 , U = u ) min u P ( Y = 1 | T = t , I S = 1 , U = u ) Maximum relative risk for Y=1 comparing two values of U in either stratum T=1 or T=0.
R R T U | I S = 1 max u P ( U = u | T = 1 , I S = 1 ) P ( U = u | T = 0 , I S = 1 ) Maximum treatment ratio with respect to U for the selected subpopulation.

The bound is defined as

(10) B ( β R S ) = B F U = R R U Y | I S = 1 R R T U | I S = 1 R R U Y | I S = 1 + R R T U | I S = 1 1 .

The bound for the risk difference in the subpopulation is the maximum of two expressions, both including R R U Y | I S = 1 and R R T U | I S = 1 . One includes the observed probability of success in the subpopulation for the controls, and the other for the treated. Its definition is

(11) B ( β D S ) = max P ( Y = 1 | T = 0 , I S = 1 ) ( B F U 1 ) , P ( Y = 1 | T = 1 , I S = 1 ) 1 1 / B F U .

Multi-selection bounds

The bounds derived by SV are investigated for one selection variable. However, often in practice, the inclusion criteria is based on several variables. In this section, we discuss the SV bounds in the case of multi-selection. To distinguish between the bias bound parameters in the single and multiple selection scenarios, we write (K) after each quantity to indicate the number of selections that are made, e.g. RR SU|T=1(1) is RR SU|T=1 in the single-selection case, i.e. conditioning on only S 1 (see Figure 1A), and RR SU|T=1(2) is RR SU|T=1 in the double-selection case, i.e. conditioning on both S 1 and S 2 (see Figure 1B).

To investigate the effects of including multi-selection, we take the derivative of the SV bounds with respect to I S . The differentiated bounds are given in Table 9, Appendix B. As an additional selection is made, the bound can either increase or decrease, depending on the influence of the additional selection on the bias bound parameters and the probabilities of success and the influence of the potential change in U. For instance, if the effect of U on the bias bound parameters are constant and both RR SU|T=1(2) ≥ RR SU|T=1(1) and RR SU|T=0(2) ≥ RR SU|T=0(1), the SV bound for the relative risk in the total population increases as the additional selection is made. If the effect of U on the bias bound parameters are constant and RR SU|T=1(2) ≤ RR SU|T=1(1) and RR SU|T=0(2) ≤ RR SU|T=0(1), the SV bound for the relative risk in the total population decreases with the additional selection. However, if RR SU|T=1(2) ≥ RR SU|T=1(1) and RR SU|T=0(2) ≤ RR SU|T=0(1), or vice versa, the SV bound for the relative risk in the total population either increases or decreases, depending on how much the two bias bound parameters change. The circumstances for these three cases (the bound increase, the bound decrease, or the bound either increase or decrease) for all four estimands, under the assumption of a constant effect of U, are summarized in Table 4. Throughout we use K=2 and K=1 as examples although the reasoning holds for any K + 1 and K.

Table 4:

Criteria for changes in the SV bound with an additional selection, under the assumption that the effect of U is constant.

B(·) increase with an additional selection B(·) decrease with an additional selection B(·) increase or decrease with an additional selection
β R RR SU|T=t (2), t = 0, 1, increase or stay constant. RR SU|T=t (2), t = 0, 1, decrease or stay constant. RR SU|T=t (2) increase, RR SU|T=1−t (2) decrease, t = 0, 1.
β D RR SU|T=t (2), t = 0, 1, P(Y = 1|T = 0, I S = 1)(2) increase or stay constant, P(Y = 1|T = 1, I S = 1)(2) decrease or stay constant. RR SU|T=t (2), t = 0, 1, P(Y = 1|T = 0, I S = 1)(2) decrease or stay constant, P(Y = 1|T = 1, I S = 1)(2) increase or stay constant. RR SU|T=t (2) increase RR SU|T=1−t (2) decrease P(Y = 1|T = t, I S = 1)(2) increase, P(Y = 1|T = 1 − t, I S = 1)(2) decrease, t = 0, 1.
β R S R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) increase or stay constant. R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) decrease or stay constant. R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) do not both in/decrease.
β D S R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) , P(Y = 1|T = t, I S = 1)(2), increase or stay constant, t = 0 or t = 1. R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) , P(Y = 1|T = t, I S = 1)(2), decrease or stay constant, t = 0 or t = 1. R R T U | I S = 1 ( 2 ) , R R U Y | I S = 1 ( 2 ) , P(Y = 1|T = t, I S = 1)(2), do not all in/decrease, t = 0 or t = 1.

From Table 4, we see that there are several influences that can contribute to changes in the bounds in different directions and, as a consequence, it is not straightforward to anticipate how the bound will alter with additional inclusion criteria. SV illustrates their bounds with an example of the effect of Zika virus on microcephaly. In Example 1, we extend their example to illustrate changes in the bounds due to a second selection.

Example 1

Zika virus and microcephaly

In northeast Brazil, after an outbreak of Zika virus (T), it was found that there was an increase in microcephaly cases (Y). A case-control study estimated the adjusted odds ratio to be 73.1 (de Araújo et al. 2018). However, in the study only live and still births were recorded (S 1=1), and pregnancies that ended with a miscarriage or an abortion were not included (S 1=0). If the probability of a pregnancy termination is affected by both being infected by Zika virus and microcephaly, the estimated odds ratio can be biased. Socioeconomic conditions, such as lack of medical care, mothers education and marital status, have been shown to affect the probability of having microcephaly (Silva et al. 2018) and can be thought of as a vector of unmeasured variables, U. For the single-selection, SV use the planning values for the bias bound parameters: RR UY|T=1=RR UY|T=0=2, RR SU|T=1(1)=1.7 and RR SU|T=0(1)=1.5, which results in the bound B ( β R ( 1 ) ) = 1.51 .

Since only births in public hospitals were used in the study (de Araújo et al. 2018), we use this as a second selection variable. Furthermore, we assume that the same unmeasured variables, U, fulfill the assumptions after the second selection, and that the effect of U on the bias bound parameters is constant. We define the difference Δ = B ( β ( 2 ) ) B ( β ( 1 ) ) . In Figure 3A we see that B ( β R ) increases if both RR SU|T=1 and RR SU|T=0 increase. Furthermore, we note that the bound can either increase, decrease or stay constant if RR SU|T=t increase and RR SU|T=1−t decrease. Note that in order for the bound to stay constant when RR SU|T=t approaches 1, RR SU|T=1−t must grow in an increasing rate. This can be seen from the definition of selection bias together with the bound. If RR SU|T=t approaches 1, then BF t (2) approach 1, and thus BF 1−t (2) determines B ( β R ) . Also, if RR SU|T=t approaches 1, then the potential selection bias in treatment group t approaches 1. In this case, the bound will decrease unless the potential selection bias in the other treatment group increase significantly.

For the risk difference in the total population, the planning values of the probabilities are chosen as P(Y=1|T=1, S 1=1)=0.065 and P(Y=1|T=0, S 1=1)=0.001 in the single-selection case. This results in the bound B ( β D ) = 1.21 . With an additional selection variable, we set them to P(Y=1|T=1, S 1=1, S 2=1)=0.08 and P(Y=1|T=0, S 1=1, S 2=1)=0.002. In Figure 3B we see that in this case it is mostly RR SU|T=1(2) that influences the difference in the bounds. From the definition of B ( β D ) , we note that this is because the observed probability of success is very small in both treatment groups.

For the relative risk in the subpopulation, the values on the bias bound parameters in the case with a single selection variable are R R T U | I S = 1 = 2.55 and R R U Y | I S = 1 = 2 . These values are chosen in coherence with the bias bound parameters for the total population, and give the bound B ( β R S ) = 1.44 . In Figure 4A we see a similar pattern as for the relative risk in the total population, when both bias bound parameters increase, the bound increase, when both bias bound parameters decrease, the bound decrease, and when one increase and the other decrease, the bound can either increase, decrease or stay constant.

For the risk difference in the subpopulation, the single-selection bound is B ( β D S ) = 0.02 . The difference between the bounds after the first and second selection is small for all values on the bias bound parameters, see Figure 4B. The bound is proportionate to the observed probabilities of success, and thus the bound is small when P(Y=1|T=t, I S =1), t=0, 1, are small. Since the probability of microcephaly is low, the bound, and in extension the difference between the bounds, is small. □

Figure 3: 
Difference between the bounds 



(

Δ
=
B

(

β

(

2

)


)

−
B

(

β

(

1

)


)


)



$\left({\Delta}=\mathcal{B}\left(\beta \left(2\right)\right)-\mathcal{B}\left(\beta \left(1\right)\right)\right)$



 in the total population when a second selection is made in the Zika virus/microcephaly example, RR

UY|T=1=RR

UY|T=0=2, for (A) the relative risk and (B) the risk difference.
Figure 3:

Difference between the bounds ( Δ = B ( β ( 2 ) ) B ( β ( 1 ) ) ) in the total population when a second selection is made in the Zika virus/microcephaly example, RR UY|T=1=RR UY|T=0=2, for (A) the relative risk and (B) the risk difference.

Figure 4: 
Difference between the bounds 



(

Δ
=
B

(

β

(

2

)


)

−
B

(

β

(

1

)


)


)



$\left({\Delta}=\mathcal{B}\left(\beta \left(2\right)\right)-\mathcal{B}\left(\beta \left(1\right)\right)\right)$



 in the subpopulation when a second selection is made in the Zika virus/microcephaly example, for (A) the relative risk and (B) the risk difference.
Figure 4:

Difference between the bounds ( Δ = B ( β ( 2 ) ) B ( β ( 1 ) ) ) in the subpopulation when a second selection is made in the Zika virus/microcephaly example, for (A) the relative risk and (B) the risk difference.

Assumption-free bounds for selection bias

Similar to the bounds for unmeasured confounding discussed in Sjölander (2020), we found that the SV bounds can give values that are outside the possible range of the bias. For the risk difference estimands, this is evident when the bounds give values that are larger than 2. For the relative risk estimands, the bounds that are too large are not as easily detected, but they exist. The definition of the estimands, the definition of the bias and the observed distribution of the data, imply restrictions of the bias bounds. In other words, we can calculate a minimum value of the causal estimand from the observed data, and thus find another bias bound. This procedure bounds the bias from above. However, if the bias is negative, the treatment can be recoded. In order for the SV bounds to be meaningful, they should give values that are inside a region of feasible values, i.e. values that are smaller than the restriction from the minimal causal estimand. In the following, we denote the assumption-free bounds by B ̃ ( ) for the corresponding estimand.

The definition of the bias, for the relative risk in the total population, is b i a s ( β R ) = β R obs / β R . Since β R obs is known from data, a maximal value of bias(β R ) occurs when β R is minimized. We find the minimal estimand

β R min = P ( Y ( 1 ) = 1 ) min P ( Y ( 0 ) = 1 ) max = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) min [ P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) , 1 ] .

From the minimal estimand, we can find an assumption-free bound,

B ̃ ( β R ) = β R obs β R min = min [ P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) , 1 ] P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) .

These probabilities can be estimated directly from the data without additional assumptions, and thus the bound is assumption free.

The assumption-free bounds for the other three estimands are found in similar ways as for the relative risk in the total population. The expressions for the assumption-free bounds are presented in Table 5. In these expressions, it is clear that the assumption-free bounds for the risk difference estimands cannot take on values larger than 2. The expressions for the minimal estimands, β D min , β R S min and β D S min are found in Appendix C.

Table 5:

The assumption-free bounds for β R , β D , β R S and β D S .

Estimand Assumption-free bound
β R B ̃ ( β R ) = P ( Y ( 0 ) = 1 ) max / P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 )
β D B ̃ ( β D ) = P ( Y ( 0 ) = 1 ) max + P ( Y = 1 | T = 1 , I S = 1 ) [ 1 P ( T = 1 | I S = 1 ) P ( I S = 1 ) ] P(Y=1|T=0, I S =1)
β R S B ̃ ( β R S ) = P ( Y ( 0 ) = 1 | I S = 1 ) max / ( P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 1 | I S = 1 ) )
β D S B ̃ ( β D S ) = P ( Y ( 0 ) = 1 | I S = 1 ) max + P ( Y = 1 | T = 1 , I S = 1 ) [ 1 P ( T = 1 | I S = 1 ) ] P(Y=1|T=0, I S =1)
  1. P ( Y ( 0 ) = 1 ) max = min [ P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) , 1 ]

    P ( Y ( 0 ) = 1 | I S = 1 ) max = min [ P ( T = 1 | I S = 1 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) , 1 ]

The assumption-free bounds are based on the most pessimistic value the causal estimand can take, i.e. the maximum bias that can occur due to selection bias. Therefore, any values of the SV bounds larger than the assumption-free bounds are infeasible. It is worth noting that the assumption-free bounds can be smaller and larger compared to the SV bound. In other words, a less pessimistic bound can be obtained when knowledge about an unknown variable U is taken into account, i.e. under Assumptions 2 or 3.

Example 2

Zika virus and microcephaly cont.

Here we illustrate the assumption-free bound with the Zika virus and microcephaly example used previously in the text. Similar to before, we set the probabilities P(Y=1|T=1, I S =1)=0.065 and P(Y=1|T=0, I S =1)=0.001. Furthermore, let P(T=1|I S =1)=0.1 and P(I S =1)=0.8. The assumption-free bounds for the relative risk and risk difference in the total population are then B ̃ ( β R ) = 6009 and B ̃ ( β D ) = 0.54 , respectively. For the subpopulation, the assumption-free bounds are B ̃ ( β R S ) = 1009 and B ̃ ( β D S ) = 0.16 .

Comparing these values to B ( ) in the single-selection case, we note that B ̃ ( ) for the relative risk estimands and the risk difference in the subpopulation are larger. Thus, the bounds in Example 1 are feasible values for the selection bias and taking the extra knowledge about U into account creates a tighter bound. However, B ̃ ( β D ) is smaller than B ( β D ) . Thus, B ( β D ) in Example 1 is not a feasible value for the bias, and using the extra knowledge about U does not increase the precision about the selection bias. □

Applications

In this section we illustrate the SV and assumption-free bounds using data from a case-control study on risk factors for type 1 diabetes (Waernbaum, Dahlquist, and Lind 2019). To provide further guidance for practicing statisticians, we developed a numerical learner that can be used to calculate the SV and assumption-free bounds for multiple inclusion criteria. The R code and accompanying material for the numerical learner is available in the GitHub depository: https://github.com/stizet/SelectionBias.

Data example: effect of pre-term birth for type 1 diabetes

In our motivating example, we study perinatal risk factors for type 1 diabetes mellitus using the Swedish Childhood Diabetes Register (SCDR). The register has a case-control design and cases and controls are linked to the Swedish Medical Birth Register (MBR) and the National Patient Register (NPR). It includes 14,949 children with type 1 diabetes onset at ages 0–14 years matched with 55,712 controls. In the original paper, effects of several risk factors for type 1 diabetes are investigated. Here, we use the binary variable very preterm delivery (gestational length < 224 days) as the treatment variable. The study population is restricted to include Nordic mothers (S 1), singleton births (S 2) and non-diabetic mothers (S 3), see Figure 5. Thus, these three variables are the selection variables. A causal model for the selection variables and related variables is displayed in Figure 6. In the model, we assume that the first selection variable, mothers’ nationality, is exogeniously given but affecting a genetic factor. We assume that the second selection, singleton births, could be caused by IVF and parity. For the third selection, mother’s diabetes status, socioeconomic status (SES) and genes are hypothesized causes (Hoekstra et al. 2008; Waldhoer et al. 2008). See the complete structure also including treatment and outcome in Figure 6.

Figure 5: 
Flow chart illustrating the selections for the type 1 diabetes example.
Figure 5:

Flow chart illustrating the selections for the type 1 diabetes example.

Figure 6: 
DAG displaying a causal model for the type 1 diabetes example following the notation in Assumptions 2 and 3.
Figure 6:

DAG displaying a causal model for the type 1 diabetes example following the notation in Assumptions 2 and 3.

The bias bound parameters and the required probabilities for the SV bound and the assumption-free bounds are given in Table 6. From previous research in the respective areas, we suggest the bias bound parameters for the SV bounds, Table 6. For the assumption-free bound, all probabilities are known from the data except P(Y=1|T=t, I S =1) since we have a case control study. The incidence of type 1 diabetes of children are investigated in Berhan et al. (2011), and based on this, we use P(Y=1|T=0, I S =1)=0.00025 for all three selections.

Table 6:

The necessary parameters and probabilities for the SV bound and the assumption-free (AF) bound for the type 1 diabetes example for the total population and subpopulation.

Bias bound parameters Selections
S1: nationality S2: singleton births S3: non diabetic mothersa
RR UY|T=1 1.2 1.5 1.5
RR UY|T=0 1.2 1.5 1.5
RR SU|T=1 1.8 2.0 2.4
RR SU|T=0 1.8 2.2 2.2
R R U Y | I S = 1 1.3 1.6
R R T U | I S = 1 2.0 2.3
P(I S =1) 0.913 0.892 0.885
P(T=1|I S =1) 0.006 0.005 0.995
P(Y=1|T=0, I S =1) 0.00025 0.00025 0.00013
  1. aThe treatment was reversed after the third selection.

Due to the case-control design we retrieve odds ratios from conditional logistic regression models. However, since type 1 diabetes is a rare outcome, we assume that the odds ratios approximate the relative risk, β R obs . The odds ratios and the SV and assumption-free bounds for both the total population and the subpopulation are seen in Table 7.

Table 7:

The causal odds ratios and the two bounds for the type 1 diabetes example for the total population and subpopulation.

Selections
S1: nationality S2: singleton births S3: non diabetic mothersa
Odds ratio, β R obs 0.59 0.59 1.89
SV in tot. pop., B ( β R ) 1.17 1.47 1.52
AF in tot. pop., B ̃ ( β R ) 122,494 206,648 8,569
SV in subpop., B ( β R S ) 1.13 1.27
AF in subpop., B ̃ ( β R S ) 4,208 7,547
  1. aThe treatment was reversed after the third selection. In the original coding β R obs = 0.53 .

For the SV bounds to be valid, the bias must be positive, i.e. the observed relative risk must be greater than the causal relative risk. After the first selection, Nordic mothers, the probability of diabetes increases in both treatment groups due to well-known genetic excess risk in the Nordic population (Patterson et al. 2009). If the probabilities change by a constant factor, the causal relative risk does not change. We do however take a conservative stand and construct the bound as if the bias is positive. Using the bias bound parameters in Table 6, the SV bound for the total population is 1.17. This means that the causal relative risk is at most shifted to 0.59/1.17=0.50. The assumption-free bound after the first selection is equal to 122,494, which is too large to be informative. Since the assumption-free bound is a ratio of probabilities, it gets very large if the probabilities are very small, as they are in this case. In our causal model, nationality is not a collider, and thus there is no bias in the subpopulation after the first selection. After the second selection, singleton births, we again make a conservative assumption and construct the bias bounds as if the bias is positive. The SV bound for the total population has now increased to 1.47, and thus the causal relative risk is at most shifted to 0.59/1.47=0.40. Once more, the assumption-free bound is not informative. In the subpopulation, the SV bound is instead 1.13, and thus the causal relative risk is at most shifted to 0.59/1.13=0.52. The assumption-free bound for the subpopulation is still too large to be informative. After the third selection, non diabetic mothers, the observed estimand is 0.53. However, we believe that the observed estimand underestimates the causal estimand. This is because mothers with type 1 diabetes are removed, and therefore the remaining group has a different socioeconomic status, which has a stronger treatment effect. Hence, we reverse the treatment coding and assume a positive bias. Subsequently, the observed estimand is 1/0.53=1.89 and the SV bound is 1.52, which at most alters the causal relative risk to 1.89/1.52=1.24. In the original treatment coding, this is a relative risk of 1/1.24=0.80, i.e. the relative risk is at least 0.80. The assumption-free bound is substantially lower due to the change in coding, but still not useful. For the subpopulation, the SV bound is 1.27, which at most alters the causal estimand to 1.89/1.27=1.49. In the original treatment coding, this is a relative risk of 1/1.49=0.67, i.e. the relative risk is at least 0.67. Again, the assumption-free bound is much larger. In this application, the assumption-free bound is large due to the small probabilities of preterm birth and type 1 diabetes, but in Appendix D, we include an example where the assumption-free bounds are more informative.

Numerical learner

For the four estimands, β R , β D , β R S and β D S , we have developed a numerical learner as a tool to calculate the SV and assumption-free bounds. It is constructed for binary variables, and dependencies between variables are generated using probit or logit models, by the choice of the user. Furthermore, it is developed for the generalized M-structure in Figure 2 with any number of selection variables. The numerical learner does not use the bias bound parameters directly, since they can be difficult to provide as the number of selection variables increase, but instead use probit/logit models specified by the user. From these models, the bias bound parameters and necessary probabilities are calculated, and the SV and assumption-free bounds are derived. The assumption-free bounds can also be calculated for an observed data set. In that case, no model assumptions, such as the generalized M-structure, are necessary.

In the code, the user inputs the parameters for the probit/logit models and the estimand of interest. The numerical learner first checks that the bias is positive and thereafter calculates the SV and assumption-free bound. The code can be repeatedly executed with different model parameter values to evaluate a possible range of bounds.

We demonstrate the numerical learner with examples in the following. First, SV’s and the assumption-free bounds for the relative risk and the risk difference in the total population are illustrated. Second, we illustrate the bounds for the relative risk and the risk difference in the selected subpopulation. The parameter values in the probit models used in the illustrations are given in Table 8.

Table 8:

Example designs for the numerical learner, α V T , α U T ( 0.2 , 1.0 ) and α V S , α U S ( 0.7 , 1.5 ) .

Models Class Linear predictor Parameter values
Design tot. pop.
P(V=1) 0.5
P(U=1) 0.5
P(T=1|V) Probit V (−0.2, 0.5)
P(Y=1|T, U) Probit T, U (−0.6, 0.5, 1.0)
P(I S =1|V, U) Probit V, U 0.3 , α V T , α U T
Design subpop.
P(V=1) 0.5
P(U=1) 0.5
P(T=1|V) Probit V (−0.2, 1.5)
P(Y=1|T, U) Probit T, U (−0.6, 0.3, 1.9)
P(I S =1|V, U) Probit V, U 1.0 , α V S , α U S

Total population

In Figure 7A, we present B ( β R ) and B ̃ ( β R ) for different values on the parameters controlling the selection probability, α U T and α V T , representing weak and moderate selection. Similarly, in Figure 7B, we present B ( β D ) and B ̃ ( β D ) for different values on α U T and α V T . For the relative risk, SV’s bound is smaller than the assumption-free bound for all combinations of α V T and α U T . However, for the risk difference the assumption-free bound is smaller than SV’s bound for all combinations of α V T and α U T . This implies that the SV bound gives feasible values for the bias for the relative risk, but it does not give feasible values for the bias for the risk difference. Furthermore, we note that B ( ) increases as α U increases, for both estimands. This is due to the influence of U on the bias bound parameters; as the influence increases, the bias bound parameters increase. However, B ̃ ( ) decreases as α U increases, for both estimands. The reason is that the overall probability of being included in the sample increases and thus the bound decreases.

Figure 7: 
Illustration of the assumption-free (AF) and SV bounds in the numerical learner for the total population with the design in Table 8 for (A) the relative risk and (B) the risk difference.
Figure 7:

Illustration of the assumption-free (AF) and SV bounds in the numerical learner for the total population with the design in Table 8 for (A) the relative risk and (B) the risk difference.

Subpopulation

In Figure 8A, we present B ( β R S ) and B ̃ ( β R S ) for different values on the parameters affecting the selection probability, α U S and α V S . Similarly, in Figure 8B, we present B ( β D S ) and B ̃ ( β D S ) for different values on α U S and α V S . For both estimands, SV’s bound is smaller than the assumption-free bound for all combinations of α V T and α U T . Furthermore, we note that B ( ) increases as α U increases, for both estimands. This is due to the influence of U on the bias bound parameters; as the influence increases, the bias bound parameters increase. As opposed to the assumption-free bound in the total population, B ̃ ( ) increases as α U increases, for both estimands. The reason for the difference between the two populations is that the overall probability of being included in the sample is not a component of the assumption-free bounds for the subpopulation.

Figure 8: 
Illustration of the AF and SV bounds in the numerical learner for the selected subpopulation with the design in Table 8 for (A) the relative risk and (B) the risk difference.
Figure 8:

Illustration of the AF and SV bounds in the numerical learner for the selected subpopulation with the design in Table 8 for (A) the relative risk and (B) the risk difference.

Discussion

In the causal inference literature, selection bias are considered for two cases: (i) estimands describing the total population and, (ii) estimands describing the selected subpopulation. For the first case bias can arise whenever the selections alter the estimand, which can occur after selection on any variables. However, if the target estimand is describing the selected subpopulation, selection bias arises when the estimand is not identified from the observed data, which occur when the selections involve colliders.

For the purpose of performing sensitivity analyses for selection bias we investigate an extension of the bounds derived by Smith and VanderWeele (2019). In line with an assumption-free bound for unmeasured confounding (Sjölander 2020), we propose an assumption-free bound for selection bias, that can be calculated using the data at hand. The proposed assumption-free bounds can be larger and smaller than SV’s bounds, indicating whether SV’s bounds are feasible. In other words, the assumption-free bounds can be more informative than SV’s bounds. The bounds are applied when investigating the effect of preterm birth on the risk of type 1 diabetes using data from a population-based incidence register for type 1 diabetes mellitus (Waernbaum, Dahlquist, and Lind 2019). In the example, we find that if the probability of the outcome is small, and/or the probability of treatment is small, the assumption-free bounds for the relative risk parameters are not informative. We provide the reader with a numerical learner, and demonstrate the magnitude of the bounds for different values of model parameters that control the dependencies in the data generating process. The numerical learner can be used by practitioners to perform sensitivity analysis by assessing the potential influence of selection bias on the conclusions of their studies.

The bounds discussed and presented here can provide guidance in a sensitivity analysis, but there are some limitations. First, the bounds are constructed within the strata of the confounders, i.e. there is no bound for the selection bias of the marginal treatment effects. Second, only bounds for binary outcome variables are considered here whereas in practice discrete and continuous outcomes are possible. The selection variables in the numerical learner are binary, although in practice selection often may come from e.g. a cut-off point of a continuous variable. To treat the current limitations, extensions in the numerical learner are readily implemented. Finally, the bounds presented only consider selection bias, although in practice other biases may also be present. Recently, a bound when there are several sources of bias present was presented by Smith, Mathur, and VanderWeele (2021), and further comparisons with it may be of interest.


Corresponding author: Stina Zetterstrom, Department of Statistics, Uppsala University, Box 513, Uppsala, 751 20, Sweden, E-mail:

Funding source: Vetenskapsrådet

Award Identifier / Grant number: 2016-00703

Acknowledgments

The authors thank Ronnie Pingel for valuable comments.

  1. Research funding: Swedish Research Council, grant No 2016-00703.

  2. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: Authors state no conflict of interest.

  4. Informed consent: Not applicable.

  5. Ethical approval: The project was approved by The Regional Ethics Committee in Umeå, 2015-44-32M.

Appendix A: Sampling variability and bias

Ignoring sampling variability can be interpreted as assessing the sample means as an approximation of the corresponding asymptotic means. For example, consider β R obs = P ( Y = 1 | T = 1 , I S = 1 ) / P ( Y = 1 | T = 0 , I S = 1 ) . Here, the convergence of the sample mean to the population expectation follows directly from a WLLN for an iid sample ( T i , X i , Y i , I S i )

1 n i : T = t , I S = 1 Y i p P ( Y = 1 | T = t , I S = 1 ) ,

and for t=0, 1 yielding the numerator and denominator in β R obs . Defining,

β ̂ R obs = 1 / n i : T = 1 , I S = 1 Y i 1 / n i : T = 0 , I S = 1 Y i

and

β ̂ D obs = 1 / n i : T = 1 , I S = 1 Y i 1 / n i : T = 0 , I S = 1 Y i ,

we have by standard convergence results, e.g. using Slutsky’s theorem that

β ̂ obs p β obs .

Appendix B: Derivatives of the SV bounds with respect to the selection variables

Table 9:

The partial derivatives with respect to I S for the SV bounds of β R , β D , β R S and β D S a , under the assumption that the effect of U is constant.

s B ( β R ) = R R S U | T = 1 R R U Y | T = 1 ( R R U Y | T = 1 1 ) ( R R U Y | T = 1 + R R S U | T = 1 1 ) 2 B F 0 + R R S U | T = 0 R R U Y | T = 0 ( R R U Y | T = 0 1 ) ( R R U Y | T = 0 + R R S U | T = 0 1 ) 2 B F 1
s B ( β D ) = R R S U | T = 1 R R U Y | T = 1 ( R R U Y | T = 1 1 ) ( R R U Y | T = 1 + R R S U | T = 1 1 ) 2

+ R R U Y | T = 1 1 R R U Y | T = 1 R R S U | T = 1 2 P Y = 1 | T = 1 , I S = 1 + R R S U | T = 0 R R U Y | T = 0 ( R R U Y | T = 0 1 ) ( R R U Y | T = 0 + R R S U | T = 0 1 ) 2 P Y = 1 | T = 0 , I S = 1 + P Y = 1 | T = 0 , I S = 1 B F 0 P Y = 1 | T = 1 , I S = 1 / B F 1
s B ( β R S ) = R R U Y | I S = 1 R R T U | I S = 1 ( R R T U | I S = 1 1 ) ( R R U Y | I S = 1 + R R T U | I S = 1 1 ) 2 + R R T U | I S = 1 R R U Y | I S = 1 ( R R U Y | I S = 1 1 ) ( R R U Y | I S = 1 + R R T U | I S = 1 1 ) 2
s B ( β D S ) = s P Y = 1 | T = 0 , I S = 1 ( B F U 1 ) =

P Y = 1 | T = 0 , I S = 1 R R U Y | I S = 1 R R T U | I S = 1 ( R R T U | I S = 1 1 ) ( R R U Y | I S = 1 + R R T U | I S = 1 1 ) 2

+ R R T U | I S = 1 R R U Y | I S = 1 ( R R U Y | I S = 1 1 ) ( R R U Y | I S = 1 + R R T U | I S = 1 1 ) 2 + P Y = 1 | T = 0 , I S = 1 ( R R U Y | I S = 1 1 ) ( R R U Y | I S = 1 1 ) ( R R U Y | I S = 1 + R R T U | I S = 1 1 ) 2
s B ( β D S ) = s P Y = 1 | T = 1 , I S = 1 ( 1 1 / B F U ) = P Y = 1 | T = 1 , I S = 1 R R U Y | I S = 1 R R T U | I S = 1 1 R R U Y | I S = 1 2 R R T U | I S = 1

+ R R T U | I S = 1 R R U Y | I S = 1 1 R R U Y | I S = 1 R R T U | I S = 1 2 + P Y = 1 | T = 1 , I S = 1 ( R R U Y | I S = 1 1 ) ( R R U Y | I S = 1 1 ) R R U Y | I S = 1 R R T U | I S = 1
  1. aThe first derivative of B ( β D S ) is the derivative if the first part of the bound is the maximum, and vice versa.

Appendix C: Assumption-free bounds

The assumption-free bounds are derived for both the total population and the subpopulation.

C.1 The total population

β R = P ( Y ( 1 ) = 1 ) P ( Y ( 0 ) = 1 ) is minimized when P(Y(1)=1) is minimized and P(Y(0)=1) is maximized.

P ( Y ( 1 ) = 1 ) min = P ( Y ( 1 ) = 1 | T = 1 ) P ( T = 1 ) + P ( Y ( 1 ) = 1 | T = 0 ) P ( T = 0 ) P ( Y ( 1 ) = 1 | T = 1 ) P ( T = 1 ) = P ( Y = 1 | T = 1 ) P ( T = 1 ) = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) + P ( Y = 1 | T = 1 , I S = 0 ) P ( T = 1 | I S = 0 ) P ( I S = 0 ) P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) .

The smallest value for P(Y(1)=1) is P(Y=1|T=1, I S =1)P(T=1|I S =1)P(I S =1).

P ( Y ( 0 ) = 1 ) max = P ( Y ( 0 ) = 1 | T = 1 ) P ( T = 1 ) + P ( Y ( 0 ) = 1 | T = 0 ) P ( T = 0 ) P ( T = 1 ) + P ( Y ( 0 ) = 1 | T = 0 ) P ( T = 0 ) = P ( T = 1 ) + P ( Y = 1 | T = 0 ) P ( T = 0 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) + P ( T = 1 | I S = 0 ) P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) + P ( Y = 1 | T = 0 , I S = 0 ) P ( T = 0 | I S = 0 ) P ( I S = 0 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) .

The maximal value for P(Y(0)=1) is P(T=1|I S =1)P(I S =1) + 2P(I S =0) + P(Y=1|T=0, I S =1)P(T=0|I S =1)P(I S =1), or 1, if the sum of probabilities exceed 1. The smallest value for β R is therefore

β R β R min = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) .

Alternatively, β R β R min = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) if the denominator is greater than 1.

For the risk difference in the total population, we have

β D min = P ( Y ( 1 ) = 1 ) min P ( Y ( 0 ) = 1 ) max = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) + 2 P ( I S = 0 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( I S = 1 ) .

Alternatively, β D min = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( I S = 1 ) 1 if P(Y(0)=1)max exceeds 1.

C.2 The subpopulation

To derive an assumption-free bound for the relative risk in the selected population it is β R S = P ( Y ( 1 ) = 1 | I S = 1 ) P ( Y ( 0 ) = 1 | I S = 1 ) that is minimized. This occurs when P(Y(1)=1|I S =1) is minimized and P(Y(0)=1|I S =1) is maximized.

P Y ( 1 ) = 1 | I S = 1 min = P ( Y ( 1 ) = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y ( 1 ) = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( Y ( 1 ) = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) .

P Y ( 0 ) = 1 | I S = 1 max = P ( Y ( 0 ) = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y ( 0 ) = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y ( 0 ) = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) ,

alternatively 1 if P(T = 1|I S = 1) + P(Y = 1|I S = 1, T = 0)P(T = 0|I S = 1) > 1.

The smallest value for β R S is therefore

β R S β R S min = P ( Y = 1 | T = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y = 1 | T = 0 , I S = 1 ) P ( T = 0 | I S = 1 ) ,

or, alternatively β R S min = P ( Y = 1 | Y = 1 , I S = 1 ) P ( T = 1 | I S = 1 ) if P ( Y ( 0 ) = 1 | I S = 1 ) max is greater than 1.

For the risk difference,

β D S min = P ( Y ( 1 ) = 1 | I S = 1 ) min P ( Y ( 0 ) = 1 | I S = 1 ) max = P ( Y = 1 | I S = 1 , T = 1 ) P ( T = 1 | I S = 1 ) P ( T = 1 | I S = 1 ) + P ( Y = 1 | I S = 1 , T = 0 ) P ( T = 0 | I S = 1 ) ,

or β D S min = P ( Y = 1 | I S = 1 , T = 1 ) P ( T = 1 | I S = 1 ) 1 if P ( Y ( 0 ) | I S = 1 ) max is greater than 1.

Appendix D: Comparison of SV and assumption-free bounds for the Zika virus example

Here, we compare the SV and assumption-free bounds for the extended Zika example with multiple selection inspired by de Araújo et al. (2018). In SV the bias bound parameters for the first selection, live and still births vs. terminated pregnancies, are assumed to be RR UY|T=1=RR UY|T=0=2.0, RR SU|T=1(1)=1.7 and RR SU|T=0(1)=1.5, see Table 10. The SV bound for the relative risk in the total population is then B ( β R ( 1 ) ) = 1.51 , see Table 11, after one selection. In coherence with these bias bound parameters we choose R R T U | I S = 1 ( 1 ) = 2.55 and R R U Y | I S = 1 ( 1 ) = 2 such that B ( β R S ( 1 ) ) = 1.44 , after the first selection. To calculate the assumption-free bound, we further assume the values in Table 10, which results in B ̃ ( β R ( 1 ) ) = 6009 for the total population and B ̃ ( β R S ( 1 ) ) = 1009 in the subpopulation. Thus, the assumption-free bounds are uninformative after one selection. However, there is a second selection, public vs. private hospitals, made as well. Considering that, we assume that the bias bound parameters instead take the values RR SU|T=1(2)=1.9 and RR SU|T=0(2)=1.3, so that some category of socioeconomic status (SES) was up to 1.9 times more likely for women without an induced abortion giving birth at a public hospital among the Zika exposed and that some SES category was up to 1.3 times more likely for women either with an induced abortion or giving birth at a private hospital among the unexposed. The SV bound for the relative risk after two selections is B ( β R ( 2 ) ) = 1.48 . Thus, assuming these values, the possible maximum of the selection bias decreases as the second selection is made. For the subpopulation, the SV bound is B ( β R S ( 2 ) ) = 1.42 , which is a small decrease compared to one selection. We further assume that the observed probabilities change when the second selection is made, see Table 10. The assumption-free bound for the total population after two selections is B ̃ ( β R ( 2 ) ) = 6720 and in the subpopulation B ̃ ( β R S ( 1 ) ) = 1006 . This means that the assumption-free bound increase for the total population and decrease for the subpopulation, although they still are too large to be informative.

Table 10:

The necessary parameters and probabilities for the SV bound and the AF bound for the Zika virus and microcephaly example for the total population and subpopulation.

Bias bound parameters Selections
S1: birth S2: hospital
RR UY|T=1 2.0 2.0
RR UY|T=0 2.0 2.0
RR SU|T=1 1.7 1.9
RR SU|T=0 1.5 1.3
R R U Y | I S = 1 2.0 2.0
R R T U | I S = 1 2.55 2.47
P(I S =1) 0.8 0.7
P(T=1|I S =1) 0.1 0.15
P(Y=1|T=0, I S =1) 0.001 0.001
P(Y=1|T=1, I S =1) 0.065 0.07
Table 11:

The causal odds ratios and risk differences and the two bounds for the Zika virus and microcephaly example for the total population and subpopulation.

Selections
S1: birth S2: hospital
Odds ratio, β R obs 73.1 73.1
SV in tot. pop., B ( β R ) 1.51 1.48
AF in tot. pop., B ̃ ( β R ) 6,009 6,720
SV in subpop., B ( β R S ) 1.44 1.42
AF in subpop., B ̃ ( β R S ) 1,009 1,006
Risk difference, β D obs 0.064 0.069
SV in tot. pop., B ( β D ) 1.21 1.26
AF in tot. pop., B ̃ ( β D ) 0.54 0.77
SV in subpop., B ( β D S ) 0.02 0.0004
AF in subpop., B ̃ ( β D S ) 0.16 0.21

Since the original study is a case-control study, the risk difference cannot be estimated without further assumptions. We do however include calculations based on the assumed probabilities in Table 10. After one selection, the SV bound is B ( β D ( 1 ) ) = 1.21 for the total population and B ( β D S ( 1 ) ) = 0.02 for the subpopulation. The assumption-free bounds are B ̃ ( β D ( 1 ) ) = 0.54 and B ̃ ( β D S ( 1 ) ) = 0.16 . Thus, the knowledge of SES does not increase the precision of the selection bias for the total population, but does so for the subpopulation. When both selections are considered, the SV bounds change to B ( β D ( 2 ) ) = 1.26 and B ( β D S ( 2 ) ) = 0.0004 , i.e. the potential selection bias increase in the total population but decrease in the subpopulation. The assumption-free bounds after two selections increase to B ̃ ( β D ( 2 ) ) = 0.77 and B ̃ ( β D S ( 1 ) ) = 0.21 . Thus, the assumption-free bound increases with the second selection, but is still more informative than the SV bound for the total population, but the SV bound is still tighter for the subpopulation.

References

Berhan, Y., I. Waernbaum, T. Lind, A. Möllsten, G. Dahlquist, and S. C. D. S. Group. 2011. “Thirty Years of Prospective Nationwide Incidence of Childhood Type 1 Diabetes: The Accelerating Increase by Time Tends to Level off in sweden.” Diabetes 60 (2): 577–81, https://doi.org/10.2337/db10-0813.Search in Google Scholar PubMed PubMed Central

de Araújo, T. V. B., R. A. D. A. Ximenes, D. D. B. Miranda-Filho, W. V. Souza, U. R. Montarroyos, A. P. L. de Melo, S. Valongueiro, M. D. F. P. M. de Albuquerque, C. Braga, S. P. B. Filho, M. T. Cordeiro, E. Vazquez, D. D. C. S. Cruz, C. M. P. Henriques, L. C. A. Bezerra, P. M. D. S. Castanha, R. Dhalia, E. T. A. Marques-Júnior, C. M. T. Martelli, L. C. Rodriques, C. Dhalia, M. Santos, F. Cortes, W. Kleber de Oliviera, G. Evelim Coelho, J. J. Cortez-Escalante, C. F. Campelo de Albuquerque de Melo, P. Ramon-Pardo, S. Aldighieri, J. Mendez-Rico, M. Espinal, L. Torres, A. Nassri Hazin, A. Van der Linden, M. Coentro, G. Santiago Dimech, R. Siqueira de Assunaco, P. Ismael de Carvalho, and V. Felix Oliveira. 2018. “Association between Microcephaly, Zika Virus Infection, and Other Risk Factors in Brazil: Final Report of a Case-Control Study.” The Lancet Infectious Diseases 18 (3): 328–36, https://doi.org/10.1016/s1473-3099(17)30727-2.Search in Google Scholar PubMed

Ding, P., and L. W. Miratrix. 2015. “To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterfly-Bias.” Journal of Causal Inference 3 (1): 41–57. https://doi.org/10.1515/jci-2013-0021.Search in Google Scholar

Ding, P., and T. J. VanderWeele. 2016. “Sensitivity Analysis without Assumptions.” Epidemiology 27 (3): 368. https://doi.org/10.1097/ede.0000000000000457.Search in Google Scholar PubMed PubMed Central

Flanders, W. D., and M. J. Khoury. 1990. “Indirect Assessment of Confounding: Graphic Description and Limits on Effect of Adjusting for Covariates.” Epidemiology 1 (3): 239–46. https://doi.org/10.1097/00001648-199005000-00010.Search in Google Scholar PubMed

Flanders, W. D., and D. Ye. 2019. “Limits for the Magnitude of M-Bias and Certain Other Types of Structural Selection Bias.” Epidemiology 30 (4): 501–8. https://doi.org/10.1097/ede.0000000000001031.Search in Google Scholar PubMed

Goetghebeur, E., S. le Cessie, B. De Stavola, E. Moodie, and I. Waernbaum. 2020. “Formulating Causal Questions and Principled Statistical Answers.” Statistics in Medicine 39 (30): 4922–48. https://doi.org/10.1002/sim.8741.Search in Google Scholar PubMed PubMed Central

Greenland, S. 2003. “Quantifying Biases in Causal Models: Classical Confounding vs. Collider-Stratification Bias.” Epidemiology 14 (3): 300–6. https://doi.org/10.1097/01.ede.0000042804.12056.6c.Search in Google Scholar

Greenland, S., J. Pearl, and J. Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology 10: 37–48. https://doi.org/10.1097/00001648-199901000-00008.Search in Google Scholar

Hernán, M. A. 2017. “Invited Commentary: Selection Bias without Colliders.” American Journal of Epidemiology 185 (11): 1048–50. https://doi.org/10.1093/aje/kwx077.Search in Google Scholar PubMed PubMed Central

Hernán, M. A., S. Hernández-Díaz, and J. M. Robins. 2004. “A Structural Approach to Selection Bias.” Epidemiology 15 (5): 615–25. https://doi.org/10.1097/01.ede.0000135174.63482.43.Search in Google Scholar PubMed

Hoekstra, C., Z. Z. Zhao, C. B. Lambalk, G. Willemsen, N. G. Martin, D. I. Boomsma, and G. W. Montgomery. 2008. “Dizygotic Twinning.” Human Reproduction Update 14 (1): 37–47. https://doi.org/10.1093/humupd/dmm036.Search in Google Scholar PubMed

Huang, T. H., and W. C. Lee. 2015. “Bounding Formulas for Selection Bias.” American Journal of Epidemiology 182 (10): 868–72. https://doi.org/10.1093/aje/kwv130.Search in Google Scholar PubMed

Lee, W. C. 2011. “Bounding the Bias of Unmeasured Factors with Confounding and Effect-Modifying Potentials.” Statistics in Medicine 30 (9): 1007–17. https://doi.org/10.1002/sim.4151.Search in Google Scholar PubMed

Lipsitch, M., E. T. Tchetgen, and T. Cohen. 2010. “Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies.” Epidemiology 21 (3): 383. https://doi.org/10.1097/ede.0b013e3181d61eeb.Search in Google Scholar PubMed PubMed Central

Liu, W., M. A. Brookhart, S. Schneeweiss, X. Mi, and S. Setoguchi. 2012. “Implications of M Bias in Epidemiologic Studies: A Simulation Study.” American Journal of Epidemiology 176 (10): 938–48. https://doi.org/10.1093/aje/kws165.Search in Google Scholar PubMed

MacLehose, R. F., S. Kaufman, J. S. Kaufman, and C. Poole. 2005. “Bounding Causal Effects under Uncontrolled Confounding Using Counterfactuals.” Epidemiology 16 (4): 548–55. https://doi.org/10.1097/01.ede.0000166500.23446.53.Search in Google Scholar PubMed

Manski, C. F. 1990. “Nonparametric Bounds on Treatment Effects.” The American Economic Review 80 (2): 319–23.Search in Google Scholar

Mayeda, E. R., E. J. Tchetgen Tchetgen, M. C. Power, J. Weuve, H. Jacqmin-Gadda, J. R. Marden, E. Vittinghoff, N. Keiding, and M. M. Glymour. 2016. “A Simulation Platform for Quantifying Survival Bias: An Application to Research on Determinants of Cognitive Decline.” American Journal of Epidemiology 184 (5): 378–87. https://doi.org/10.1093/aje/kwv451.Search in Google Scholar PubMed PubMed Central

McCandless, L. C., P. Gustafson, and A. Levy. 2007. “Bayesian Sensitivity Analysis for Unmeasured Confounding in Observational Studies.” Statistics in Medicine 26 (11): 2331–47. https://doi.org/10.1002/sim.2711.Search in Google Scholar PubMed

Patterson, C. C., G. G. Dahlquist, E. Gyürüs, A. Green, G. Soltész, and E. S. Group. 2009. “Incidence Trends for Childhood Type 1 Diabetes in Europe during 1989–2003 and Predicted New Cases 2005–20: A Multicentre Prospective Registration Study.” The Lancet 373 (9680): 2027–33, https://doi.org/10.1016/s0140-6736(09)60568-7.Search in Google Scholar PubMed

Pizzi, C., B. De Stavola, F. Merletti, R. Bellocco, I. dos Santos Silva, N. Pearce, and L. Richiardi. 2011. “Sample Selection and Validity of Exposure–Disease Association Estimates in Cohort Studies.” Journal of Epidemiology & Community Health 65 (5): 407–11. https://doi.org/10.1136/jech.2009.107185.Search in Google Scholar PubMed

Robins, J. M. 1989. “The Analysis of Randomized and Non-randomized Aids Treatment Trials Using a New Approach to Causal Inference in Longitudinal Studies.” In Health Service Research Methodology: A Focus on AIDS, edited by L. Sechrest, H. Freeman, and A. Mulley, 113–59. US Public Health Service, National Center for Health Services Research.Search in Google Scholar

Rubin, D. B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. https://doi.org/10.1037/h0037350.Search in Google Scholar

Silva, A. A., M. A. Barbieri, M. T. Alves, C. A. Carvalho, R. F. Batista, M. R. Ribeiro, F. Lamy-Filho, Z. C. Lamy, V. C. Cardoso, R. C. Cavalli, V. M. Simões, and H. Bettiol. 2018. “Prevalence and Risk Factors for Microcephaly at Birth in brazil in 2010.” Pediatrics 141 (2): e20170589, https://doi.org/10.1542/peds.2017-0589.Search in Google Scholar PubMed

Sjölander, A. 2020. “A Note on a Sensitivity Analysis for Unmeasured Confounding, and the Related E-Value.” Journal of Causal Inference 8 (1): 229–48. https://doi.org/10.1515/jci-2020-0012.Search in Google Scholar

Smith, L. H. 2020. “Selection Mechanisms and Their Consequences: Understanding and Addressing Selection Bias.” Current Epidemiology Reports 7: 179–89. https://doi.org/10.1007/s40471-020-00241-6.Search in Google Scholar

Smith, L. H., and T. J. VanderWeele. 2019. “Bounding Bias Due to Selection.” Epidemiology 30 (4): 509–16. https://doi.org/10.1097/ede.0000000000001032.Search in Google Scholar

Smith, L. H., M. B. Mathur, and T. J. VanderWeele. 2021. “Multiple-bias Sensitivity Analysis Using Bounds.” Epidemiology 32 (5): 625–34. https://doi.org/10.1097/ede.0000000000001380.Search in Google Scholar

VanderWeele, T. J., and P. Ding. 2017. “Sensitivity Analysis in Observational Research: Introducing the E-Value.” Annals of Internal Medicine 167 (4): 268–74. https://doi.org/10.7326/m16-2607.Search in Google Scholar PubMed

Waernbaum, I., G. Dahlquist, and T. Lind. 2019. “Perinatal Risk Factors for Type 1 Diabetes Revisited: A Population-Based Register Study.” Diabetologia 62 (7): 1173–84. https://doi.org/10.1007/s00125-019-4874-5.Search in Google Scholar PubMed PubMed Central

Waldhoer, T., B. Rami, E. Schober, and A. D. I. S. Group. 2008. “Perinatal Risk Factors for Early Childhood Onset Type 1 Diabetes in austria–a Population-Based Study (1989–2005).” Pediatric Diabetes 9 (Part I): 178–81. https://doi.org/10.1111/j.1399-5448.2008.00378.x.Search in Google Scholar PubMed

Whitcomb, B. W., and P. F. McArdle. 2016. “Collider-stratification Bias Due to Censoring in Prospective Cohort Studies.” Epidemiology 27 (2): e4–5.10.1097/EDE.0000000000000432Search in Google Scholar PubMed

Received: 2022-02-17
Accepted: 2022-11-23
Published Online: 2022-12-21

© 2022 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 11.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/em-2022-0108/html
Scroll to top button