An Instrumental Variables Design for the Effect of Emergency General Surgery

Luke Keele; Catherine E. Sharoky; Morgan M. Sellers; Chris J. Wirtalla; Rachel R. Kelz

doi:10.1515/em-2017-0012

Article Publicly Available

An Instrumental Variables Design for the Effect of Emergency General Surgery

Luke Keele , Catherine E. Sharoky , Morgan M. Sellers , Chris J. Wirtalla and Rachel R. Kelz

Published/Copyright: October 2, 2018

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Epidemiologic Methods Volume 7 Issue 1

Abstract

Confounding by indication is a critical challenge in evaluating the effectiveness of surgical interventions using observational data. The threat from confounding is compounded when using medical claims data due to the inability to measure risk severity. If there are unobserved differences in risk severity across patients, treatment effect estimates based on methods such a multivariate regression may be biased in an unknown direction. A research design based on instrumental variables offers one possibility for reducing bias from unobserved confounding compared to risk adjustment with observed confounders. This study investigates whether a physician’s preference for operative care is a valid instrumental variable for studying the effect of emergency surgery. We review the plausibility of the necessary causal assumptions in an investigation of the effect of emergency general surgery (EGS) on inpatient mortality among adults using medical claims data from Florida, Pennsylvania, and New York in 2012–2013. In a departure from the extant literature, we use the framework of stochastic monotonicity which is more plausible in the context of a preference-based instrument. We compare estimates from an instrumental variables design to estimates from a design based on matching that assumes all confounders are observed. Estimates from matching show lower mortality rates for patients that undergo EGS compared to estimates based in the instrumental variables framework. Results vary substantially by condition type. We also present sensitivity analyses as well as bounds for the population level average treatment effect. We conclude with a discussion of the interpretation of estimates from both approaches.

Keywords: Instrumental variables; emergency surgery; research design

1 Introduction

For a large number of acute conditions, treatment by emergency general surgery (EGS) may be more effective than non-operative care in reducing mortality. Assessing the effectiveness of EGS, however, is difficult since allocation of EGS is likely subject to confounding by indication. Bias from confounding by indication is often present when patients are selected to receive medical treatments based on prognostic factors that indicate which patient would benefit from the treatment. For example, indications for treatment via EGS, such as physiological measures, affect the likelihood of both the treatment and the outcome. To reduce bias from confounding by indication, methods for risk adjustment (RA) such as multivariate regression analysis and propensity score matching or weighting may be applied. However, many studies of EGS use medical claims data, and these data contain few measures of risk severity. As such, study designs based on RA methods may fail to completely remove bias given that prognostic factors are incompletely recorded.

One alternative to RA methods is the use of the instrumental variable (IV) framework. A valid IV would provide a consistent estimate of the effect of EGS in the presence of unobserved confounding between the exposure and outcome (Angrist et al. 1996). While IV designs rely on untestable assumptions, those assumptions can be judged plausible through the careful use of quantitative and qualitative evidence (Baiocchi et al. 2014). In this study, we propose using a physician’s preference for operative care as an instrument for EGS. Surgeons vary in their preference for surgery, which we define as the proportion of times they perform surgery in similar clinical scenarios. We create similar clinical scenarios by limiting the patient population to a specific set of diagnoses with the same emergency admission status. For example, consider two patients in the emergency department with a perforated gastric ulcer who are similar on more than 50 different factors including medical comorbidities, age, etc. We use this preference for operative care as a arbitrary or haphazard nudge for EGS. Our IV design replaces the assumption of no unmeasured confounding with the assumption that a physicians’ preference for surgical care affects the outcomes of patients only through the receipt of surgery. We first review the plausibility of the IV assumptions using both qualitative and quantitative evidence. We argue that the standard deterministic monotonicity assumption is unlikely to hold and invoke stochastic monotonicity (Small et al. 2017). Using claims data, we conduct an analysis based on the proposed instrument. Our study utilizes near-far matching to further strengthen the plausibility of the IV assumptions (Baiocchi et al. 2010, 2012). We contrast the IV estimates with those from a design based on RA via matching. We then present multiple sensitivity analyses and bounds for the population level average treatment effect. Finally, we include supplementary analyses as recommended by Swanson and Hernán (2013).

2 Data sources

Our study uses a hybrid dataset linking the American Medical Association (AMA) Physician Masterfile with all-payer hospital discharge claims from New York, Florida and Pennsylvania in 2012–2013. These states permit the linkage of patient claims to surgeon and hospital characteristics. We restricted the study population to all patients admitted for inpatient care emergently, urgently, or through the emergency department with a diagnosis of an acute general surgical condition.

We classified acute general surgical condition types using a modified list of 124 International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes that represent the scope of emergency general surgery (Shafi et al. 2013). All subjects were categorized into one of nine possible surgical conditions (resuscitation, general abdominal, upper gastrointestinal, colorectal, hernia, intestinal obstruction, hepatobiliary, skin and soft tissue, and vascular) and 51 specific acute surgical sub-conditions. See the appendix for the full list of sub-conditions and all related ICD-9-CM diagnosis codes. The initial cohort consisted of 605,498 patients that presented emergently with a diagnosis for a condition where either non-operative or operative management may be considered.

We denote our outcome, inpatient mortality, as Y∈{0,1}. The exposure or treatment of primary interest is whether a patient was managed operatively in an emergency setting, which we denote as D. For D=1, a patient received operative care including an emergency surgical procedure for one of the 51 acute sub-conditions. For D = 0, the patient’s condition was managed through non-surgical methods.

Patient sociodemographic and clinical characteristics were abstracted from the claims datasets. For each patient, a measure of frailty was calculated using a set of specific ICD-9-CM codes which represent clinical manifestations of frail patients in administrative data (Kim and Schneeweiss 2014). We also defined an indicator for severe sepsis using the Angus implementation for severe sepsis algorithm (Angus et al. 2001). Any patients with an explicit diagnosis of severe sepsis (995.92) or septic shock (785.52), as well as patients with ICD-9-CM code representing infection with a concomittant ICD-9-CM code reflecting acute organ dysfunction were classified as having severe sepsis (Iwashyna et al. 2014). Finally, we used Elixhauser indices to define 31 comorbidities (Elixhauser et al. 1998).

Surgeons were excluded if they could not be identified in the AMA Masterfile, did not meet criteria for general surgery training, did not attend an allopathic program, or did not train within the United States. Surgeons also had to perform at least 5 operations for one of the 51 specific acute surgical sub-conditions per year within the two-year study time-frame. Using this inclusion criteria, the cohort included 4,094 surgeons. Surgeon age and experience were abstracted from the AMA Physician Masterfile. To better understand whether a surgeon’s preference for operative care was correlated with surgical skill, we developed two additional surgeon level covariates designed to measure surgeon quality. Using a separate subset of the data, for each surgeon, we calculated the proportion of times a patient died or had a prolonged length of stay (pLOS) when he or she operated. We defined these measures as surgical mortality for past procedures and pLOS for past procedures.

3 An IV study design

Next, we outline our proposed IV design. Here, we review the causal assumptions and carefully assess the plausibility of each assumption in the context of our study. The goal is to identify and estimate the causal effect of D on Y. Assuming that all the common causes of D and Y are fully observed is implausible, since measures of risk severity are unrecorded in the medical claims data we use. Thus, we seek to find an instrument, Z, for D. An instrument is an as-if random nudge that encourages exposure to D and affects the outcome only to the extent that it alters the treatment received. A valid instrument allows us to identify the causal effect of D on Y in the presence of unmeasured confounders for a sub-population of patients. See Baiocchi et al. (2014), Ertefaie et al. (2017), and Angrist and Pischke (2009) for overviews of instruments.

3.1 The instrument: Physician preference for operative management

We use a physician’s preference for operative management as an instrument for EGS. We rely on the observation that surgeons demonstrate differing preferences for operative versus non-operative care. Our instrument is an example of a broad class of instruments based on a physician’s treatment preferences, which have been widely used to study the effectiveness of drugs (Brookhart and Schneeweiss 2007). To our knowledge, this is the first use of a preference-based instrument defined at the level of a surgeon. Like all preference-based instruments, our measured instrument does not have a direct causal effect on D, since it is a proxy for the physician’s true preference for operative care. Thus our measured IV is a surrogate IV (Hernán and Robins 2006).

We used random sample splitting to separate the measurement of the instrument from the use of the instrument in the data analysis. This avoids bias from using the same data twice (Bound et al. 1995). For each surgeon, we randomly split his or her patient population in half. Using one half of the data, we calculated the proportion of times a surgeon operated for the population of patients they treated with the 51 specific acute conditions. This measure serves as the instrument for the remaining patient population. For the surgeons in our data, this measure expressed as percentage varied from less than 1% to 100%, with an average of 59%. As such, there is wide variability in preferences for operative care across surgeons. Often instruments of this type are defined at the hospital level or geographically (Brooks et al. 2003; Stukel et al. 2007). We could easily measure the instrument at the hospital level. The data, however, provide evidence that this is not an optimal measurement strategy. If we measure the instrument at the hospital level, the measure varied from 1% to 38% with an average of 16%. As such, the variation in the instrument is much compressed. This is not surprising, since the decision to operate in a given emergent admission for emergent admissions primarily resides with the individual surgeon. Moreover, the correlation between the instrument and the outcome is also weaker at the hospital level. Thus defining the instrument at the hospital level obscures considerable variation in whether a patient is treated operatively or not.

3.2 Causal assumptions

Next, we review and assess the necessary assumptions for our IV to identify the effect of D on Y. We begin with general causal assumptions and then proceed to the specific IV assumptions.

First, we assume that the stable unit treatment value assumption (SUTVA) holds (Rubin 1986). SUTVA is comprised of the two following components: (1) the treatment levels of D (1 and 0) adequately represent all versions of the treatment and 2) a subject’s outcomes are not affected by other subjects’ exposures. The first component of SUTVA is often referred to as the consistency assumption in the epidemiology literature. In our study, D represents a wide variety of surgical procedures, from the removal of an appendix to wound debridement. However, as we outline below, we restrict comparisons within surgical conditions such that within each subgroup of patients the possible treatment options are limited and highly similar. That is, while there may be some variation in how patients receive an appendectomy, we assume that this variation in treatment corresponds to the same potential outcomes. The second component of SUTVA rules out a subject’s outcomes being affected by other subjects’ exposures. In general, it is difficult to conceive of how the selection of surgical versus non-surgical care for one patient could affect outcomes for other patients even within the same hospital. One possibility is through resource constraints. Surgeons and emergency operating rooms are a limited resource, and whether one patient receives surgery could be affected by factors related to resource availability or access. However, in most cases resource constraints would only result in a delay of surgery instead of switching a patient to non-operative management.

Next, we turn to specific IV assumptions. Identification in the IV framework typically relies on the following four assumptions: (1) ignorable (as-if random) assignment of the instrument to patients; (2) the instrument must have a nonzero effect on the exposure; (3) no direct effect of the instrument on the outcome, also known as the exclusion restriction; and (4) monotonicity (Angrist et al. 1996).

Ignorability. First, it must be the case that patients are assigned surgeons with high or low preferences for operative care in an as-if random fashion. That is, care from a surgeon with a higher preference for surgery should tell us nothing of clinical relevance about that surgeon’s patients. The fact that patients are receiving emergency care bolsters this assumption, since it is unlikely that patients are paired with a surgeon based on that doctor’s preference for operative treatments in an acute care setting. We assessed whether assignment to the instrument appears as-if random by calculating covariate balance for patients cared for by surgeons above and below the median preference for operative care. Note that this is only a partial assessment of this assumption, since we cannot test whether unobservables are balanced. Moreover, while the IV assumption does not strictly imply balance on observables, such diagnostics are widely recommended (Baiocchi et al. 2012; 2013).

Table 1 contains means and standardized differences for selected baseline covariates. The standardized difference is the difference in means divided by the pooled standard deviation. A standardized difference of one implies that the difference in means is equal to one standard deviation. Table 1 also contains balance statistics for whether patients received EGS or not. In almost every case, we find that standardized differences are lower when stratifying by the IV rather than the exposure, which implies that allocation of physician preference appears more as-if random than the allocation of EGS. Of the variables recorded in Table 1 we might be most concerned about measures that are proxies for surgical skill. That is, we might be particularly concerned that those surgeons that prefer to operate are more skilled. The results in Table 1 do little to suggest that is the case. That is, surgeons with a higher preference for operating are very similar to surgeons with a lower preference for operating in terms of age and experience. Even more striking is that surgeons with a higher preference of operating had nearly identical levels of mortality and prolonged lengths of stay for past procedures. Thus, at best, our measures of surgical skill are weakly correlated with our proposed instrument. The appendix contains full balance statistics for all measured covariates.

Next, we use bias ratios to further assess whether an IV design may have less bias than a design based on RA (Brookhart and Schneeweiss 2007; Baiocchi et al. 2014). For this analysis, we first calculate prevalence differences under each design. The prevalence difference under the IV design is defined as the difference in means for baseline covariates across a binary measure of the IV. We created two binary measures of the IV. The first stratifies the patient population at the median value of the IV. The second compares patient covariates for surgeons in the top quartile of the IV to patient covariates for surgeons in the 25th to 75th percentile of the IV. The prevalence difference under an RA design is the difference in means for baseline covariates across levels of the exposure – undergoing EGS or not. Following Baiocchi et al. (2014), we then calculated bias ratios as the IV prevalence difference divided by strength of the IV divided by the RA prevalence difference. Bias ratios of less than one indicate less bias associated with the IV design as compared to RA (Baiocchi et al. 2014). For 110 measured confounders, the IV design had less bias for 77% of the covariates using the median split for the IV. Using the split at the upper quartile, the IV design had less bias for 62% of the covariates.

Table 1:

Distribution of selected baseline characteristics by surgical or non-operative care or the physician preference for surgery instrumental variable, PA, FL, NY 2012–2013.

	Surgical care			Instrumental variable
	Treated (EGS)	Control (No surgery)		Above median	Below median
	(N=184,689)	(N=119,092)	Std. Diff	(N=176,543)	(N=127,238)	Std. Diff

Age	55.86	60.38	0.24	55.91	60.03	0.21
No. Comorbidities	2.44	3.06	0.28	2.45	3.01	0.25
Income	52.01	51.92	0.01	52.25	51.60	0.05
Income missing	0.04	0.04	0.00	0.04	0.04	0.02
Female	0.54	0.51	0.06	0.54	0.52	0.05
Hispanic	0.15	0.13	0.07	0.15	0.13	0.05
White	0.72	0.69	0.06	0.71	0.70	0.03
African-American	0.14	0.17	0.09	0.14	0.17	0.07
Other racial cat.	0.14	0.13	0.02	0.14	0.13	0.04
Sepsis	0.14	0.17	0.08	0.15	0.17	0.06
Disability	0.07	0.11	0.15	0.07	0.11	0.13
Age 18–35	0.19	0.12	0.19	0.19	0.12	0.17
Age 36–48	0.17	0.15	0.06	0.17	0.15	0.05
Age 49–58	0.18	0.18	0.01	0.18	0.18	0.01
Age 59–67	0.16	0.17	0.03	0.16	0.17	0.03
Age 68–77	0.15	0.17	0.05	0.15	0.17	0.05
Age 78+	0.16	0.22	0.15	0.16	0.21	0.13
Medicare	0.39	0.49	0.20	0.39	0.48	0.18
Medicaid	0.13	0.13	0.01	0.13	0.12	0.02
Commercial insurance	0.36	0.30	0.14	0.36	0.30	0.13
Self	0.08	0.06	0.08	0.08	0.06	0.07
Other type of payment	0.04	0.03	0.03	0.04	0.04	0.01
Surgeon age	54.37	54.20	0.02	54.41	54.16	0.02
Years experience	17.08	16.73	0.03	17.12	16.70	0.04
Experience missing	0.04	0.04	0.03	0.04	0.04	0.03
Surgical mortality - Past proceduresx	0.03	0.04	0.28	0.03	0.03	0.23
pLOS - Past procedures	0.08	0.09	0.11	0.08	0.09	0.13

We performed additional analyses to explore the relative differences in balance between the IV and RA approaches. First, we plotted scaled and unscaled covariate balance estimates with confidence intervals (Davies 2015; Jackson and Swanson 2015). Figure 1 contains these plots for key covariates. The plot clearly demonstrates that the differences are relatively small and the confidence intervals typically overlap. Plots for the full set of covariates are included in the appendix. We also calculated tested whether these differences were significantly different following Davies et al. (2017). We found that in many cases the differences were not statistically significant. See the appendix for results of this analysis. However, these diagnostic plots also show that measures of surgical skill are at best weakly associated with the proposed instrument.

Figure 1:

Bias plots for selected covariates.

IV Strength. Next, we evaluated whether the instrument is correlated with the exposure. When patients have a physician above the median in terms of his or her preference for surgery, they are managed operatively 76% of the time. When patients have a physician below the median in his or her preference for surgery, they receive operative care 38% of the time. We also conducted a weak instrument test; the F-statistic is 29,000, well above the critical value threshold of 16.4 (Stock and Yogo 2005).

Exclusion Restriction. Next, it must the the case that any effect of a surgeon’s propensity to operate on the outcome is only a consequence of the medical effects of operative management. A violation of this assumption would occur if receipt of care from a surgeon with a strong preference for surgery includes other aspects of care that might affect outcomes. For example, if nursing care provided to patients of high preference surgeons was superior to that provided to patients of low preference surgeons, this would violate the exclusion restriction. We bolster this assumption by comparing patients who receive care in the same hospital. This should hold fixed other system factors of care that might affect outcomes. For patients within the same hospital, it is unlikely that a surgeon’s preference for operative care will have any affect on outcomes except through receipt of EGS. See below for details on how we used matching to implement within hospital comparisons in the statistical analysis.

Monotonicity. In an IV design, subjects fall into four different classes: compliers, always-takers, never-takers, and defiers (Angrist et al. 1996). Compliers are patients who are exposed to surgery if they were encouraged by the IV but would not otherwise have surgery. Always-takers have surgery regardless of assigned IV status, and never-takers are never exposed to surgery regardless of IV status. Defiers are patients who are only exposed to the treatment (surgery) if not encouraged or are unexposed if encouraged. If no defiers are present, the IV estimate is well-defined for compliers. Many IV designs in epidemiology assume that no patients are defiers, which is generally referred to as the monotonicity assumption.

Recent work has critiqued the monotonicity assumption in the context of preference-based instruments (Swanson and Hernán, 2014, 2017). Monotonicity violations are likely when the instrument is not delivered in a uniform way to all subjects. This occurs with preference-based instruments since the delivery of the preference to patients varies across physicians. In our context, for the monotonicity assumption to hold it must be the case that there are no patients who would receive surgery when seen by a physician who usually does not prefer surgery but would not be given surgery when seen by a physician who usually prefers surgery.

In general, since a surgeons’ preferences represent weighing a variety of risks and benefits, there may be opportunities for a patient to be treated contrary to a physicians’ preferences. As an example, assume we have two surgeons who work in the same hospital. Surgeon A generally prefers to operate but makes exceptions for hernia patients (because of some known contraindications). Surgeon B generally tries to avoid surgical treatments but makes exceptions for patients who are very healthy and can withstand the invasive nature of many procedures. Thus any patient that has a hernia but is also very healthy would be treated in a way contrary to each surgeons’ general preferences for operative care and is a defier.

Given the implausibility of the monotonicity assumption in our design, we adopt the framework of stochastic monotonicity (Small et al. 2017). The stochastic monotonicity assumption has two components. First, Z must be independent of the potential outcomes for Y and unobserved common causes of Z and Y. Second, if we stratify on all measured and unmeasured confounders of D and Y, then within each stratum there are more compliers than defiers. How does this translate into our context?

We must assume that for patients with similar unmeasured common causes of D and Y, the probability of surgery is at least as high when treated by a surgeon with a strong preference for operative care as it is for patients seen by a surgeon with a weak preference for operative care. This assumptions appears reasonable. First, the primary unmeasured common causes of D and Y are the health status of the patient before surgery. Thus for stochastic monotonicity to hold, it must be true that for patients with similar unobserved risk indicators, the probability of surgery is the same or higher when they are treated by a surgeon who prefers to operate as it is when they are treated by a surgeon that does not prefer to operate. Given that many surgeons have a strong preference for operative care, this would seem highly plausible.

Invoking the stochastic monotonicity assumption does not change our estimation strategy. Small et al. (2017) show that the conditional and unconditional Wald estimator remains a consistent estimator under stochastic monotonicity. However, the target causal estimand differs. If we had invoked deterministic monotonicity, the target causal estimand would be the effect of surgery among the compliers known as the local average treatment effect (LATE) in the subpopulation of compliers (Angrist et al. 1996). Under stochastic monotonicity, the IV estimand is a weighted average among the subgroups for whom the IV has a stronger effect get more weight. For example, as we noted above, we defined 9 surgical conditions that can be managed with operative and non-operative treatments. If surgeon preference for operative care has a stronger effect among those patients with one of the nine conditions, this subgroup will receive a greater weight in the IV estimate. However, these subgroups are not known a priori. We can seek to identify them by finding groups where the IV has the strongest effect. Below we conduct a sensitivity analysis to understand how the IV estimates might generalize to the larger patient population.

4 Data analysis

4.1 Matching

We used matching to both perform risk adjustment and increase the plausibility of the IV design. As such, we implement two separate matches. One we designate as the RA match, and one we designate as the IV match. In the RA match, we match by EGS status, and in the IV match, we match by IV status. For the IV design, we applied near-far matching. Near-far matching pairs patients such that differences in observed confounders are minimized while the contrast in instrument values is maximized (Baiocchi et al. 2010). In our application, this will produce matched pairs of patients that are similar in terms of baseline covariates, but less similar in terms of the physician’s preference for operative care. A near-far match will increase comparability while further strengthening the instrument. IV designs based on stronger instruments are more resistant to bias, and near-far matching allows us to further increase the strength of our instrument (Small and Rosenbaum 2008).

For both the RA and IV match, we applied the following matching constraints. First, we matched patients exactly within hospital. Exactly matching within hospitals controls for any relevant hospital level factors that contribute to patient level outcomes, and implies that the level and overall quality of care outside of surgery should be similar, which increases the plausibility of the exclusion restriction. Furthermore, matching within the hospitals removes the effects of potential differences in claims coding across hospitals. We also exactly matched on the nine indicators for surgical condition.

We applied near-fine balance constraints to the 51 indicators for surgical sub-conditions. Fine balance constrains two groups to be balanced on a particular variable without restricting matching on the variable within individual pairs (Rosenbaum et al. 2007). A match with fine balance on the surgical sub-conditions produces patients paired on the IV with identical marginal distributions for these indicators. Rosenbaum et al. (2007) uses the term “fine balanced” since a variable has been balanced in terms of the marginal distribution, but units are not exactly matched on the variable. A near-fine balance constraint returns a finely balanced match when one is feasible, and otherwise minimizes the deviation from fine balance (Yang et al. 2012). Thus by applying near-fine balance to the surgical sub-conditions, we place no restriction on individual matched pairs–any one treated subject can be matched to any one control, but the marginal distribution of these indicators will be exactly or nearly exactly the same across levels of the IV. We implemented fine balance via a matching algorithm that finely balances the joint set of interactions among these variables (Pimentel et al. 2015). For the remaining covariates, we minimized the total of the within-pair distances on covariates as measured by the Mahalanobis distance (Mahalanobis 1936; Rubin 1980). Finally, we also applied optimal subsetting to discard matched pairs with high levels of imbalance on covariates (Rosenbaum 2012).

To assess the quality of the matches, we used the standardized difference after matching, which is calculated for a given covariate as the mean difference between matched patients divided by the pooled standard deviation before matching (Cochran and Rubin 1973). In the match, we attempted to produce standardized differences of less than 0.10, the recommended threshold (Silber et al. 2001; Rosenbaum 2010). We found that balancing all 110 covariates to this standard caused a considerable loss of sample size. We increased the sample size using the following analytic strategy. First, we improved the balance until fewer than 20 covariates had standardized differences of less than 0.30. For these covariates we applied additional covariate adjustment via regression. We settled on a match with 16 covariates that had standard differences larger than 0.09. The largest of these was 0.23 and 14 of the covariates had standardized differences between 0.10 and 0.14. The appendix outlines an additional analytic strategy we used for the measures of surgical skill.

4.2 Outcome estimates

For the IV design, we estimated risk differences applying the Wald estimator via two-stage least squares after matching. For the RA design, we estimated risk differences using linear probability models. After matching, we also applied the weak instrument test (Stock and Yogo 2005). We corrected standard errors for clustering within surgeons under both RA and IV. We also sought to understand whether effects varied with condition type. We performed subgroup analyses for each of the nine major acute surgical condition types. To perform the subgroup analyses, after matching, outcomes were compared separately within each condition type to understand whether differences between operative and non-operative care differed by condition risk.

As a robustness check, we also estimated IV results without matching. For these estimates, we used two-stage least squares only, conditioning on the same set of measured covariates. For these models, we also included fixed effects for hospital, the nine general conditions, and the 51 specific sub-conditions. We report these estimates in the appendix as they generally agreed with estimates based on matching.

5 Results

We identified an initial cohort of 2,629,928 patients with one of the 51 acute conditions where surgical treatment would be be an option and were emergency cases. In this initial cohort, approximately 15% received surgery. We then removed patients who received care from a surgeon that had fewer than 5 patients each year or performed fewer than 5 operations each year for a cohort of 605,498 patients. We then randomly split this cohort in half. We used one half to calculate IV values, and the other half for the data analysis. After removing six observations due to data errors, the final cohort used in the analysis consists of 303,778 patients. In this cohort, 184,677 patients underwent EGS, and 176,543 received care from a surgeon that had a higher than average level of preference for operative care.

Here, we first report unadjusted estimates and those based on the RA match. Table 2 contains the estimates of the effect of EGS for the unadjusted, risk adjusted and instrumental variables models. In the unadjusted data, outcomes differed by whether patients received operative or non-operative care. Patients that received EGS had lower overall mortality rates (risk difference (RD): – 0.9%, 95% CI, – 1.1, – 0.7). RA via matching produced 58,645 pairs matched exactly on hospital and nine condition types and finely balanced on 51 additional condition subtypes. The appendix contains complete balance statistics for all variables used in the match. Patients that underwent EGS had higher mortality rates (RD: 0.7%, 95% CI, 0.4, 0.9). Using the model based estimates, we also conducted a Hausman test, which tests the difference between the IV and RA estimates. This test is interpreted as test of exogeneity for the exposure. The test statistic from the Hausman test was 68 (P < 0.001), which is evidence against a risk adjusted approach. The appendix contains complete balance statistics for the near-far match.

The near-far match generated 37,636 pairs of patients that were matched to be similar on observed covariates, but more distant in terms of the physicians’ preference for surgery. That is, within each matched pair, one patient received care from a surgeon with a lower propensity to operate, and one patient received care from a surgeon with a higher propensity to operate. The average absolute standardized difference for 110 covariates was 0.04, with an IQR of 0.08. The F-value from a weak instrument test with the matched data was 24,588.56 (P < 0.001) with an R² of 0.25.

Table 2:

Estimates of the effect of emergency general surgery on mortality rates when compared to non-operative management: unadjusted, risk adjusted, and IV estimates, PA, FL, NY 2012–2013.

	Unadjusted	Risk adjusted	IV
Risk difference	– 0.9	0.7	2.3
95% CI	[– 1.1, – 0.7]	[0.4, 0.9]	[1.3, 3.3]
Mortality rate (per 1000)	– 9.2	6.9	23.1

Estimates by condition type

Skin and soft tissue

Risk difference	0.7	0.01	1.8
95% CI	[0.4, 1.1]	[– 0.3, 0.4]	[0.4, 3.1]
Mortality rate (per 1000)	7.4	0.5	17.8

General abdominal
Risk difference	5.5	2.9	5.1
95% CI	[4.6, 6.5]	[1.7, 4.1]	[2.0, 8.3]
Mortality rate (per 1000)	55.4	29.1	51.2

Hernia
Risk difference	– 0.3	– 0.5	0.5
95% CI	[– 0.8, 0.2]	[– 1.3, 0.4]	[– 0.8, 1.8]
Mortality rate (per 1000)	– 3.1	– 4.6	4.8

HPB
Risk difference	– 3.7	– 0.9	– 1.0
95% CI	[– 4.2, – 3.2]	[– 1.3, -0.4]	[– 2.2, 0.2]
Mortality rate (per 1000)	– 37.3	– 8.6	– 10.1

Intestinal obstruction
Risk difference	1.6	1.0	– 1.6
95% CI	[1.3, 1.8]	[0.6, 1.4]	[– 3.4, 0.2]
Mortality rate (per 1000)	15.6	9.8	– 16.1

Resuscitation
Risk difference	0.2	2.5	3.2
95% CI	[– 1.7, 2.1]	[0.4, 4.6]	[– 3.1, 9.5]
Mortality rate (per 1000)	1.9	25.3	31.9

Upper GI
Risk difference	– 1.0	1.3	5.4
95% CI	[– 1.3, – 0.6]	[0.6, 2.0]	[3.5, 7.3]
Mortality rate (per 1000)	– 9.8	13.2	54.1

Vascular
Risk difference	7.4	3.9	13.5
95% CI	[6.2, 8.5]	[2.5, 5.4]	[8.8, 18.3]
Mortality rate (per 1000)	73.8	39.2	135.1

Colorectal
Risk difference	0.6	0.9	3.0
95% CI	[0.2, 0.9]	[0.5, 1.3]	[1.8, 4.2]
Mortality rate (per 1000)	5.8	9.2	29.8

The sign of the IV estimate (risk difference 2.3%, 95% CI, 1.3, 3.3) is the same as the estimate based on RA. However, the IV estimate of the effect of EGS on mortality is three times larger than the estimate based on risk adjustment (2.3% vs 0.7%). Table 2 also contains the estimate for the effect of EGS on mortality by condition type. Figure 2 contains a graphical comparison of IV versus RA estimates. The overall estimate of the effect of EGS on mortality masks significant variation by condition type. For hernia conditions, EGS does not effect mortality under either RA or IV. For hepatobiliary (HPB) conditions, both estimates show that EGS reduces the risk of death. In the skin and soft tissue condition category, the RA and IV estimates conflict (RA: 0.01%, 95% CI, – 0.3, 0.4 vs IV: 1.8%, 95% CI, 0.4, 3.1). Finally, the effect of EGS on mortality is largest for vascular conditions, and the IV estimate is substantially larger than the estimate based on RA (RA: 3.9%, 95% CI, 2.5, 5.4 vs IV: 13.5%, 95% CI, 8.8, 18.3).

Figure 2:

IV and risk adjusted estimates of the effect of EGS overall and by condition type, PA, FL, NY 2012–2013. Horizontal bars represent 95% confidence intervals adjusted for clustering at the surgeon level.

In the near-far matched pairs, patients with a surgeon less likely to operate received surgery 21% of the time, while patients who received care from a surgeon who was more likely to operate underwent EGS 63% of the time. Using the methods in Small et al. (2017), we estimated that 38% of the matched sample are compliers. Therefore, our LATE estimates are pertinent to 38% of the sample. The magnitude of the effect of EGS for the rest of the study population is unknown without additional assumptions. We calculated descriptive statistics for the IV population and compared these descriptive statistics to those from the rest of the study population. Table 3 contains results for a subset of the covariates. The appendix contains the complete results. The IV-weighted population has more patients with a disability (13% vs 8%) and a sepsis diagnosis (22% vs 16%). The IV population also has more older patients and thus more patients with Medicare. The two populations differ minimally in terms of race or surgeon characteristics such as age, experience, or past performance.

Table 3:

The IV weighted population of compliers compared to the overall patient population, PA, FL, NY 2012–2013.

	Complier sub-population means	Population means	Ratio of means
Age	61.84	57.63	1.07
No. Comorbidities	3.35	2.69	1.25
Income (1000s)	52.08.75	51.97	1.00
Surgeon age	53.91	54.30	0.99
Years experience	16.59	16.94	0.98
Surgical mortality - Past procedures	0.04	0.03	1.18
pLOS - Past procedures	0.10	0.08	1.23
Income missing	0.03	0.04	0.82
Female	0.51	0.53	0.97
Hispanic	0.14	0.14	0.96
White	0.71	0.71	1.00
African-American	0.17	0.15	1.11
Other racial cat.	0.12	0.14	0.86
Sepsis	0.22	0.16	1.40
Disability	0.13	0.08	1.51
Age 18–35	0.10	0.16	0.61
Age 36–48	0.14	0.16	0.84
Age 49–58	0.18	0.18	1.00
Age 59–67	0.17	0.16	1.09
Age 68–77	0.19	0.16	1.18
Age 78+	0.23	0.18	1.25
Medicare	0.52	0.43	1.21
Medicaid	0.11	0.13	0.87
Commercial insurance	0.28	0.34	0.82
Self	0.06	0.07	0.81
Other type of payment	0.03	0.04	0.97
Experience missing	0.02	0.04	0.63

5.1 Sensitivity analyses

Finally, we perform several different sensitivity analyses to understand whether our conclusions would be altered if we violated key assumptions. Following the recommendation in Swanson and Hernán (2013), we report nonparametric bounds for the IV estimate (Balke and Pearl 1997). These bounds relax the assumption of no defiers and the exclusion restriction. The nonparametric bounds for mortality are – 39% and 22%, which are consistent with both a strong harmful and protective effect. The bounds demonstrate that a considerable amount of information is provided by the two key IV assumptions. However, nonparametric bounds for mortality based on the average treatment effect using the method of Manski (1990) are wider: – 61% and 39%.

Next, to understand whether the estimated effect of EGS based on IV is possibly the result of an unobserved confounder between the instrument and the outcome, we apply a form of sensitivity analysis outlined in Small and Rosenbaum (2008). The sensitivity analysis indicated that the mortality finding would remain statistically significant in the presence of a confounder that increased the odds of both EGS and higher mortality by 5%. Thus our estimates are relatively sensitive to a hidden confounder. For vascular procedures, an unmeasured confounder would have to increase the odds of both EGS and higher costs by 220%. For vascular conditions, it would take an substantial amount of unmeasured confounding to qualitatively change our conclusions.

Next, we perform a sensitivity analysis that allows us to understand whether the estimates would change as the proportion of defiers increases (Small et al. 2017). Under this method, we calculate bounds on the IV estimate as the possible percentage of the defiers increases. We found that these bounds for the IV estimate did not include zero until at least 8% of the study population are defiers. The lower bound on the 95% confidence interval for this estimate did not include zero unless 4% of the population are defiers. To understand whether these are large or small values, we estimated the fraction of defiers in our data using the methods in Small et al. (2017). We estimated this quantity by fitting logistic regressions of D on X for the Z = 1 subjects and D on X for the Z = 0 subjects. We fitted these models to the matched data, and specified X as the covariates identified for additional adjustment after matching. We used these estimates and the full data to predict the probability of D = 1 under high and low values of the instrument. We then calculated the proportion of subjects, conditional on the covariates, that were exposed contrary to their instrument value. This estimated proportion of defiers in our data is 0. See the appendix for details on this sensitivity analysis.

Finally, we can also bound the global average treatment effect as a function of the extent of treatment effect heterogeneity – i.e. variation in the IV estimate due to patient population characteristics (Small et al. 2017). We calculated bounds for the global average treatment effect as a function of the difference between the average treatment effect for the always-takers and never-takers.

Table 4 contains the bounds on the global ATE as well as bootstrapped 95% confidence intervals. For a moderate level of effect heterogeneity the global treatment effect could be as small as 0.48% or as large as 4.9%. In general, the bounds are more closely aligned with a larger harmful effect in the full population.

Table 4:

Bounds for global ATE varying treatment effect heterogeneity.

Effect heterogeneity	Lower bound	Upper bound	Bootstrapped 95% CI
1.1	1.99	2.35	[1.56, 3.36]
1.5	1.40	2.94	[1.15, 4.13]
2	0.79	3.55	[0.71, 4.93]
2.5	0.24	4.10	[0.33, 5.64]
3	– 0.29	4.63	[– 0.10, 6.32]
5	– 2.25	6.59	[– 2.60, 8.87]

6 Discussion

Our study estimates the effect of EGS on mortality using both RA methods and a new IV based on a physician’s preference for operative care using medical claims data. A limitation of medical claims data in this context is the absence of risk severity measures. While both designs rely on untestable assumptions, only the IV design can identify causal effects in the presence of unobserved confounders between EGS and the outcome. We evaluated the IV assumptions using both quantitative and qualitative evidence. We found that physician preference for operative care is strongly associated with whether patients receive emergency general surgery or not. Moreover, we used near-far matching to bolster the IV assumptions. By exactly matching on hospital, we ensure that the care environment for operative and non-operative patients is highly comparable. This comparability increases the plausibility of the exclusion restriction, since this reduces the likelihood that non-operative factors effect outcomes.

Our use of near-far matching also raises methodological questions for future exploration. Much recent work has focused on methods for the evaluation of the bias across levels of the IV (Davies 2015; Jackson and Swanson 2015; Davies et al. 2017). However, this work has not explored which method is optimal for evaluating bias across levels of the IV after matching. In our analysis, we used the usual standardized difference to judge bias and to guide whether to include covariates in outcome models. Issues that could be explored include whether scaled bias estimates might be misleading in the evaluation of a near-far match which is designed to increase instrument strength. The use of inferential methods such as confidence intervals Davies et al. (2017) must also be more carefully applied after matching (Ho et al. 2007). Moreover, measures of bias are of greater concern when the covariate is correlated with the outcome (Zhao and Small 2018).

Decision-making is especially difficult for clinicians when confronted with an emergent medical condition for which there is little or no evidence to serve as a guide (Obirieze et al. 2013; Birkmeyer et al. 2013). Moreover, evidence from randomized trials on EGS effectiveness is rare given ethical considerations of equipoise and practical barriers. As such, a research design based on observational data that reduces bias from confounding by indication could improve physician’s ability to provide appropriate patient care and to obtain informed consent when faced with management decisions in the emergency setting. In this study, the IV design indicates that analyses using RA may, in general, underestimate the risks of EGS. Given the higher risk of death in the IV design, further investigations are needed to better understand the risks of EGS.

Many prior applications of IV in health services research have been used to compare whether more aggressive levels of care yield better outcomes. For example, Lorch et al. (2013) use an IV approach to study whether delivery at a high-level NICU reduced in-hospital death. In this example, the exposure compares different levels of care intensity at the system level. In applications of this type, IV estimates frequently show a protective effect for more aggressive forms of care. In our application, we compare a more aggressive form of treatment, an operation, to non-operative management while holding the care environment constant. This suggests that the elevated risk of death from EGS may stem from the invasive aspects of surgery.

Our study has limitations. RA and IV estimate different types of effects (Angrist et al. 1996; Baiocchi et al. 2012). The IV estimate focuses on those patients that receive EGS because they had a surgeon with a higher preference for operative care. In contrast, RA estimates represent the average effect of EGS for treated patients. One avenue for future research would be to further estimate the portion of defiers using survey methods (Swanson et al. 2015). Another limitation is the limited coverage of the sample, since we only use data from three states. Another important avenue for investigation is to estimate the extent to which the IV estimates vary in populations with different incentive structures for surgeons, since our instrument is a property of the surgeon in a fee for service model. However, we think it is important to note that while the magnitude of the IV and RA estimates differ, the sign of those estimates are typically the same. Thus designs based on very different assumptions produce broadly consistent results. This agreement across study designs is also an important clinical insight.

In conclusion, our study suggests that a physician’s preference for operative care is a valid instrumental variable. While the validity of our conclusions depends on strong assumptions, we presented both qualitative and quantitative evidence to validate the proposed IV. Our IV should be a useful tool for reducing bias from confounding by indication when studying surgical interventions using large observational databases where measurements are necessarily incomplete. Careful application of this IV in future studies may allow us to build an evidence base for the effectiveness of many surgical interventions.

Acknowledgements:

We thank Scott Lorch, Dylan Small and Hyunseung Kang for comments and suggestions. Conflicts of Interest and Source of Funding: RRK is funded by a grant from the National Institute on Aging, R01AG049757- 01A1. The content of this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The dataset used for this study was purchased with a grant from the Society of American Gastrointestinal and Endoscopic Surgeons. Although the AMA Physician Masterfile data is the source of the raw physician data, the tables and tabulations were prepared by the authors and do not reflect the work of the AMA. The Pennsylvania Health Cost Containment Council (PHC4) is an independent state agency responsible for addressing the problems of escalating health costs, ensuring the quality of health care, and increasing access to health care for all citizens. While PHC4 has provided data for this study, PHC4 specifically disclaims responsibility for any analyses, interpretations or conclusions. The authors declare no conflicts.

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455.10.3386/t0136Search in Google Scholar

Angrist, J. D, and Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton, NJ: Princeton University Press.10.1515/9781400829828Search in Google Scholar

Angus, D. C., Linde-Zwirble, W. T., Lidicker, J., Clermont, G., Carcillo, J., and Pinsky, M. R. (2001). Epidemiology of severe sepsis in the united states: analysis of incidence, outcome, and associated costs of care. Critical Care Medicine, 29(7):1303–1310.10.1097/00003246-200107000-00002Search in Google Scholar

Baiocchi, M., Cheng, J., and Small, D. S. (2014). Instrumental variable methods for causal inference. Statistics in Medicine, 33(13):2297–2340.10.1002/sim.6128Search in Google Scholar

Baiocchi, M., Small, D. S., Lorch, S., and Rosenbaum, P. R. (2010). Building a stronger instrument in an observational study of perinatal care for premature infants. Journal of the American Statistical Association, 105(492):1285–1296.10.1198/jasa.2010.ap09490Search in Google Scholar

Baiocchi, M., Small, D. S., Yang, L., Polsky, D., and Groeneveld, P. W. (2012). Near/far matching: a study design approach to instrumental variables. Health Services and Outcomes Research Methodology, 12(4):237–253.10.1007/s10742-012-0091-0Search in Google Scholar

Balke, A., and Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171–1176.10.1080/01621459.1997.10474074Search in Google Scholar

Birkmeyer, J. D., Reames, B. N., McCulloch, P., Carr, A. J., Campbell, W. B., and Wennberg, J. E. (2013). Understanding of regional variation in the use of surgery. The Lancet, 382(9898):1121–1129.10.1016/S0140-6736(13)61215-5Search in Google Scholar

Bound, J., Jaeger, D., and Baker, R. (1995). Problems with intrustmental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90(430):443–450.10.1080/01621459.1995.10476536Search in Google Scholar

Brookhart, M. A, and Schneeweiss, S. (2007). Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. The International Journal of Biostatistics, 3(1):14.10.2202/1557-4679.1072Search in Google Scholar PubMed PubMed Central

Brooks, J. M., Chrischilles, E. A., Scott, S. D., and Chen-Hardee, S. S. (2003). Was breast conserving surgery underutilized for early stage breast cancer? instrumental variables evidence for stage II patients from Iowa. Health Services Research, 38(6p1):1385–1402.10.1111/j.1475-6773.2003.00184.xSearch in Google Scholar

Cochran, W. G, and Rubin, D. B. (1973). Controlling bias in observational studies. Sankyha-Indian Journal of Statistics, Series A, 35:417–446.Search in Google Scholar

Davies, N. M. (2015). Commentary: an even clearer portrait of bias in observational studies? Epidemiology (Cambridge, Mass.), 26(4):505.10.1097/EDE.0000000000000302Search in Google Scholar

Davies, N. M., K. H. Thomas, A. E. Taylor, G. M. Taylor, R. M. Martin, M. R. Munafò, and F. Windmeijer. 2017. How to compare instrumental variable and conventional regression analyses using negative controls and bias plots, International Journal of Epidemiology, Vol. 46 (6), 2067–2077.10.1093/ije/dyx014Search in Google Scholar

Elixhauser, A., Steiner, C., Harris, D. R., and Coffey, R. M. (1998). Comorbidity measures for use with administrative data. Medical Care, 36(1):8–27.10.1097/00005650-199801000-00004Search in Google Scholar PubMed

Ertefaie, A., Small, D. S., Flory, J. H., and Hennessy, S. (2017). A tutorial on the use of instrumental variables in pharmacoepidemiology. Pharmacoepidemiology and Drug Safety, 26(4):357–367.10.1002/pds.4158Search in Google Scholar PubMed

Hernán, M. A, and Robins, J. M. (2006). Instruments for causal inference: an epidemiologists dream. Epidemiology, 17(4):360–372.10.1097/01.ede.0000222409.00878.37Search in Google Scholar PubMed

Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3):199–236.10.1093/pan/mpl013Search in Google Scholar

Iwashyna, T. J., Odden, A., Rohde, J., Bonham, C., Kuhn, L., Malani, P., Chen, L., and Flanders, S. (2014). Identifying patients with severe sepsis using administrative claims: patient-level validation of the angus implementation of the international consensus conference definition of severe sepsis. Medical Care, 52(6):e39.10.1097/MLR.0b013e318268ac86Search in Google Scholar PubMed PubMed Central

Jackson, J. W, and Swanson, S. A. (2015). Toward a clearer portrayal of confounding bias in instrumental variable applications. Epidemiology, 26(4):498.10.1097/EDE.0000000000000287Search in Google Scholar PubMed PubMed Central

Kim, D. H., and Schneeweiss, S. (2014). Measuring frailty using claims data for pharmacoepidemiologic studies of mortality in older adults: evidence and recommendations. Pharmacoepidemiology and Drug Safety, 23(9):891–901.10.1002/pds.3674Search in Google Scholar PubMed PubMed Central

Lorch, et al. 2012. “The differential impact of delivery hospital on the outcomes of premature infants,” in Pediatrics, Vol. 130(2), 270–278.10.1542/peds.2011-2820Search in Google Scholar PubMed PubMed Central

Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2(1):49–55.Search in Google Scholar

Manski, C. F. (1990). Nonparametric bounds on treatment effects. The American Economic Review Papers and Proceedings, 80(2):319–323.Search in Google Scholar

Obirieze, A. C., Kisat, M., Hicks, C. W., Oyetunji, T. A., Schneider, E. B., Gaskin, D. J., Haut, E. R., Efron, D. T., Cornwell III, E. E., and Haider, A. H. (2013). State by state variation in emergency versus elective colon resections: room for improvement. The Journal of Trauma and Acute Care Surgery, 74 (5):1286.10.1097/01586154-201305000-00015Search in Google Scholar

Pimentel, S. D., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2015). Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527.10.1080/01621459.2014.997879Search in Google Scholar PubMed PubMed Central

Rosenbaum, P. R., Ross, R. N., and Silber, J. H. (2007). Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association, 102(477):75–83.10.1198/016214506000001059Search in Google Scholar

Rubin, D. B. (1980). Bias reduction using Mahalanobis-metric matching. Biometrics, 36(2):293–298.10.1017/CBO9780511810725.014Search in Google Scholar

Rubin, D. B. (1986). Which ifs have causal answers. Journal of the American Statistical Association, 81(396):961–962.10.1080/01621459.1986.10478355Search in Google Scholar

Rosenbaum, P. R. (2010). Design of Observational Studies. New York: Springer-Verlag.10.1007/978-1-4419-1213-8Search in Google Scholar

Rosenbaum, P. R. (2012). Optimal matching of an optimally chosen subset in observational studies. Journal of Computational and Graphical Statistics, 21(1):57–71.10.1198/jcgs.2011.09219Search in Google Scholar

Shafi, S., Aboutanos, M. B., Agarwal Jr, S., Brown, C. V., Crandall, M., Feliciano, D. V., Guillamondegui, O., Haider, A., Inaba, K., Osler, T. M., et al. (2013). Emergency general surgery: definition and estimated burden of disease. Journal of Trauma and Acute Care Surgery, 74(4):1092–1097.10.1097/TA.0b013e31827e1bc7Search in Google Scholar PubMed

Silber, J. H., Rosenbaum, P. R., Trudeau, M. E., Even-Shoshan, O., Chen, W., Zhang, X., and Mosher, R. E. (2001). Multivariate matching and bias reduction in the surgical outcomes study. Medical Care, 39(10):1048–1064.10.1097/00005650-200110000-00003Search in Google Scholar PubMed

Small, D., and Rosenbaum, P. R. (2008). War and Wages: the strength of instrumental variables and their sensitivity to unobserved biases. Journal of the American Statistical Association, 103(483):924–933.10.1198/016214507000001247Search in Google Scholar

Small, D., Tan, Z., Ramsahi, R., Lorch, S., and Brookhart, A. (2017). Instrumental variable estimation with a stochastic monotonicity assumption. Statistical Science, In Press.10.1214/17-STS623Search in Google Scholar

Stock, J. H., and M. Yogo. 2005. “Testing for weak instruments in linear IV regression,” in Identification and Inference in Econometric Models: Essays in Honor of, edited by T. J. Rothenberg, D. W. Andrews, and J. H. Stock. Cambridge, UK: Cambridge University Press10.1017/CBO9780511614491Search in Google Scholar

Stukel, T. A., Fisher, E. S., Wennberg, D. E., Alter, D. A., Gottlieb, D. J., and Vermeulen, M. J. (2007). Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. Jama, 297(3):278–285.10.1001/jama.297.3.278Search in Google Scholar PubMed PubMed Central

Swanson, S. A., and Hernán, M. A. (2013). Commentary: how to report instrumental variable analyses (suggestions welcome). Epidemiology, 24(3):370–374.10.1097/EDE.0b013e31828d0590Search in Google Scholar PubMed

Swanson, S. A., and Hernàn, M. A. (2014). Think globally, act globally: an epidemiologist’s perspective on instrumental variable estimation. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 29(3):371.10.1214/14-STS491Search in Google Scholar

Swanson and Hernàn. 2017. The challenging interpretation of instrumental variable estimates under monotonicity. International Journal of Epidemiology Vol. 47(4):, 1289–1297.dyx03810.1093/ije/dyx038Search in Google Scholar PubMed PubMed Central

Swanson, S. A., Miller, M., Robins, J. M., and Hernán, M. A. (2015). Definition and evaluation of the monotonicity condition for preference-based instruments. Epidemiology (Cambridge, Mass.), 26(3):414.10.1097/EDE.0000000000000279Search in Google Scholar

Yang, D., Small, D. S., Silber, J. H., and Rosenbaum, P. R. (2012). Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68(2):628–636.10.1111/j.1541-0420.2011.01691.xSearch in Google Scholar PubMed PubMed Central

Zhao, Q., and D. S. Small. 2018. “Graphical diagnosis of confounding bias in instrumental variables analysis,” in Epidemiology Vol. 29(4), e29–e31.10.1097/EDE.0000000000000822Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (DOI:https://doi.org/10.1515/em-2017-0012).

Received: 2017-09-11

Revised: 2018-02-22

Accepted: 2018-03-05

Published Online: 2018-10-02

Supplementary Material Details

Articles in the same Issue

https://doi.org/10.1515/em-2017-0012

Keywords for this article

Instrumental variables; emergency surgery; research design