The necessity of construct and external validity for deductive causal inference

Kevin M. Esterling; David Brady; Eric Schwitzgebel

doi:10.1515/jci-2024-0002

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

The necessity of construct and external validity for deductive causal inference

Kevin M. Esterling , David Brady and Eric Schwitzgebel

Published/Copyright: February 21, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Causal Inference Volume 13 Issue 1

Abstract

The Credibility Revolution advances internally valid research designs intended to identify causal effects from quantitative data. The ensuing emphasis on internal validity, however, has enabled a neglect of construct and external validity. We show that ignoring construct and external validity within identification strategies undermines the Credibility Revolution’s own goal of understanding causality deductively. Without assumptions regarding construct validity, one cannot accurately label the cause or outcome. Without assumptions regarding external validity, one cannot label the conditions enabling the cause to have an effect. If any of the assumptions regarding internal, construct, and external validity are missing, the claim is not deductively supported. The critical role of theoretical and substantive knowledge in deductive causal inference is illuminated by making such assumptions explicit. This article critically reviews approaches to identification in causal inference while developing a framework called causal specification. Causal specification augments existing identification strategies to enable and justify deductive, generalized claims about causes and effects. In the process, we review a variety of developments in the philosophy of science and causality and interdisciplinary social science methodology.

Keywords: causality; validity; deduction; generalization; identification

MSC 2010: 03A05

1 Introduction

The past few decades have witnessed the rise of the “Credibility Revolution” in quantitative causal inference [1–7]. Researchers influenced by the Credibility Revolution advocate for a design-based approach to evaluating causal effects. Arguably, this approach has come to dominate quantitative research in the social sciences and in policy evaluation [8,9].^[1] The design-based approach is at the core of popular textbooks on quantitative research design for PhD and MPP training [10–12]. To recognize the impact of this approach on quantitative social science, two of the key innovators of the Credibility Revolution, Joshua Angrist and Guido Imbens, were awarded the Nobel Prize in 2021. In its announcement, the Nobel Committee stated the Credibility Revolution “improved researchers’ ability to answer causal questions of great importance for economic and social policy [...]”.

The Credibility Revolution has contributed to the study of causality in at least two important ways [9]. First, the field promotes a deductive approach to discovering causal effects by requiring that assumptions be made explicit. In this tradition, a causal claim is said to be “credible” when the connection between a statistical result and a causal effect of interest deductively follows from transparently stated assumptions [9,14,15]. In a deductive view, claims regarding causality are not justified by data and analysis alone.^[2] Instead, such claims require an explicit statement of the assumptions that deductively guarantee a causal conclusion.

Second, tightly related to the first contribution, the Credibility Revolution advances designs that warrant assumptions sufficient to “identify” causal effects from observed data [2–5]. In particular, the field advances designs that warrant an assumption of internal validity. While we will highlight the randomized controlled trial (RCT) design, this tradition also advances natural experimental designs such as instrumental variables, matching, difference-in-differences, and regression discontinuity that each posit different assumptions [16]. Each of these designs seeks to ensure that the contrast of interest to the research question is not confounded and hence is internally valid. Given the design assumptions, the quantitative analyst can compare aggregate outcomes between counterfactual states of the world to detect the presence of a cause.

While causal identification strategies primarily focus on internal validity, they routinely omit considerations of construct and external validity [17, p. 366].^[3] This article reviews the foundations of identification for quantitative causal inference and demonstrates this neglect in the literature results in a substantial inferential gap. Identifying a causal effect among measured variables in an internally valid research design is insufficient on its own to ensure a deductive claim about what actually causes what. Without construct validity, one risks mislabeling the cause or outcome [18]. Without external validity, one risks misunderstanding the conditions enabling the cause to have an effect [19]. Thus, the causal inference literature must incorporate assumptions regarding construct and external validity within identification strategies to enable deductive, generalized claims about causes and effects.

As even Angrist and Pischke [1, p. 23] acknowledge, the most that internal validity allows the analyst to deduce is the simple fact that some cause occurred in the course of an experiment. But internal validity sheds no light on what were the cause and effect, or why the effect occurred. As a result, internal validity provides no guidance on how the presence of causality shown in a given study or across studies can accumulate into deductively formed knowledge. Instead, in their view, any substantive claims about what was the cause, what was the effect, and what were the enabling conditions can at best be – in their words – “speculative” (see also [20, p. 959]). Without explicit assumptions about construct and external validity, however, such speculations remain tentative in a way that undermines the deductive “credibility” of a causal claim, contradicting the very purpose of the Credibility Revolution.

In practice, applied researchers – even leading proponents of the Credibility Revolution – routinely make general causal claims without explicitly declaring what is speculation, and thus fail to make their assumptions transparent or their causal claims deductively “credible.” For example, Gerber et al. [21, p. 33] claim that mailing postcards that reveal one’s voting history to neighbors “demonstrate[s] the profound importance of social pressure as an inducement to political participation.” Likewise, Angrist et al. [22, p. 858] claim that a specific charter school design “generated substantial score gains,” attributing causality to the school design itself. Based on RCTs of an anti-poverty program, Banerjee et al. [23] claim: “It is possible to make sustainable improvements in the economic status of the poor with a relatively short-term intervention.” These authors make claims about actual causes and actual effects without sufficiently clarifying that they had intended these internally valid but not necessarily construct or externally valid claims to rely on – and knowledge in their field to accumulate from – “speculation.”

Using insights from the philosophy of science, this article provides a critique that aims to close this gap in the literature on causal inference and the Credibility Revolution. We explain how, whether the analyst recognizes it or not, any deductively “credible” generalized causal claim requires assumptions of construct and external validity – assumptions that ultimately derive from theory [26–29] and substantive knowledge [30,31]. Thus, deduction in quantitative causal inference fundamentally depends on a combination of theory and descriptively, historically, or qualitatively derived substantive knowledge along with any design-based statistical evidence.

To be clear: the argument for this article is not simply repeating the widespread view that “construct and external validity are important.” We fully agree with Shadish et al. [18] that each type of validity is necessary for causal generalization. But our argument is more specific and goes beyond that understanding. We demonstrate that like internal validity, construct and external validity also are necessary to preserve the deductiveness of a causal claim.^[4] Nor do we take a position on whether researchers do or do not ignore construct or external validity in practice. Even if the discussion of these types of validities were pervasive, the current understanding of their necessity for deduction is absent or at best implicit.^[5] We show that neglecting the assumptions for construct and external validity undermines deduction – i.e., the very task scholars in the Credibility Revolution set out to accomplish.

To demonstrate the consequences of neglecting construct and external validity, we develop a framework we call causal specification.^[6] This framework formalizes assumptions regarding internal, construct, and external validity within a single causal expression and shows that causal deduction requires a rebalancing that equally values all three validities. Within a deductive understanding of causality, internal validity has no special status or lexical priority. The causal specification framework explicitly recognizes the contributions of theory and substantive knowledge to quantitative causal inference, and charts a way forward for social scientists who aim to make deductive causal generalizations.

2 Credibility revolution

Many of the advances in the Credibility Revolution have been governed by one of two comprehensive frameworks for causal inference, the potential outcomes framework, also known as the “Rubin causal model” (RCM) [20], and the structural causal model (SCM) framework developed by Pearl [3]. In either framework, a causal effect is defined by comparing the counterfactual outcomes – i.e., what would have happened if the cause had been present versus absent, while everything else had remained the same [36–38]. Because it is not possible to observe events that do not actually occur, at least half of the relevant cases remain unobserved post-intervention. A causal effect is said to be identified only if the effect described in counterfactual terms can be deduced from stated assumptions and the data, in the ideal circumstances where the analyst had an infinite amount of data [9,39,40]. Because it involves inference from observed data to unobserved counterfactuals, identification requires a set of assumptions about the relation between the observed data and the causal effect of interest.

We illustrate identification within the RCT design using a fictional vignette in which The Gold Standard Lab (GSL) undertakes a quantitative evaluation of the causal effect of an intervention aiming to increase juror turnout.^[7] Jury service is a form of democratic participation and is a political right that governments can coerce [41]. Of course, not everyone who receives a summons actually shows up at the courthouse for jury service, so courts routinely seek low-cost methods to increase the yield for jury summonses [42]. Inspired by get out the vote (GOTV) studies (e.g., refs [43,44]), GSL exposes residents in the court’s jurisdiction to different messages using jury summons reminder postcards (replicating [45]) to evaluate the messages’ causal effect.

Fortunately, the researchers in GSL are well trained in design-based causal inference and so they conduct a gold-plated RCT. GSL asks the Riverside (California) County Superior Court to mail official government postcards to residents who recently received a jury summons, randomizing so that half receive a standard reminder postcard and the other half receive a postcard, indicating that failure to appear could result in fines or imprisonment. The “enforcement” condition results in a statistically significant 10% increase in turnout relative to the “reminder” condition – an effect size more than 20 times that typically found in GOTV postcard experiments.

Given these strong results, GSL recommends courts adopt the enforcement message as a policy, and they publish an article containing the causal generalization: “Enforcement messages cause juror turnout.” Eager to demonstrate the efficacy of the enforcement message in other contexts, GSL next collaborates with the superior court in Orange County, California – a more affluent adjacent county – to implement the identical gold-plated evaluation. Much to their surprise, the enforcement postcard shows no treatment effect, from which they reason that affluence suppresses the enforcement message effect.

The GSL design very much adheres to the identification strategies of the Credibility Revolution. We can formalize the GSL study design and analysis as follows. There are n units indexed by i ∈ { 1 , … , n } . Throughout, we assume that each unit i is a randomly selected unit of the sampling frame S (on the role of sampling in RCTs [46]). We assume a binary representation of the causal variable, where A = 1 indicates the treatment state (having received a postcard with the enforcement message) and A = 0 indicates the control state (having received a postcard with the reminder message). B is a binary outcome variable, where B i = 1 if the unit reported for jury duty and B i = 0 if they did not. B i ( A = 1 ) represents the counterfactual outcome B conditioned on unit i ’s being in the treatment state, regardless of which postcard the unit actually received. Likewise, B i ( A = 0 ) represents the counterfactual outcome conditionally upon that unit being in the control state. For every unit of analysis i , these terms represent the potential outcome B that would have occurred had A = 1 occurred, or respectively, not occurred.

In an RCT, internal validity is present if the experimental units’ assignment to treatment or control is unrelated to their outcomes in either state, represented formally as

(1) [ B i ( A = 1 ) , B i ( A = 0 ) ] ⊥ ⊥ A i .

The symbol ⊥ ⊥ means “is independent of.” A weaker version of (1) only requires the claim to be true within strata of covariates. The assumption in (1) implies that the units in treatment and control have identical distributions of potential outcomes in expectation, and hence, each group can supply the missing counterfactual for the other. It follows that the “average treatment effect” estimand (ATE) is identified and can be estimated using the observed data:

(2) A T E = E S [ B i ( A i = 1 ) ] − E S [ B i ( A i = 0 ) ] ,

where E S is the expectation over S and estimation is over the observed data.

The right-hand side of (2) is the estimand of the A T E since, under assumption (1), the expected difference is not driven by bias that would otherwise occur from confounding [10, p. 38]. Under assumption (1), the only systematic difference between those in the A = 1 condition and those in the A = 0 condition is their exposure to A . Researchers can interpret the estimated statistical relationship between A and B as a counterfactual causal claim of identifying the A T E if assumption (1), the core assumption of internal validity is true.^[8]

We focus on RCTs where the assumption of internal validity is well justified by the randomization of unit assignments, i.e., randomization renders the assumption of internal validity relatively weak. The claim to have identified a causal effect, however, in no way depends on the strength of the assumptions associated with any specific research design. The conclusion that a design identified a causal effect deductively follows from the premises encoded in the assumptions laid out in the formal apparatus of identification, such as selection on observables for matching and regression, continuity for regression discontinuity, or the parallel path assumption for difference-in-differences [3,16]. Once the assumptions are made, the conclusion follows deductively.

The Credibility Revolution was motivated by previous generations’ naïve reliance on regression models of observational data to test for causality. In that context, applied researchers invoked verbal assurances that they included all needed control variables. These assurances typically strained credulity [8]. In turn, many researchers developed what Stokes [47] refers to as “radical skepticism” about unobserved confounders. We develop our arguments with the RCT design so that we can focus on the problems for causal inference even when one has near-perfect internal validity. Radical skepticism was an excess. However, the skeptical thinking that helped motivate the focus on internal validity equally justifies concerns about construct validity and external validity, as will become evident in our framework and discussion below.

3 Validity and causal deduction

The literature informed by the credibility revolution takes the variables that are actually measured as fundamental for understanding causality. As Holland [20] notes, the variables that actually are measured in a scientific procedure – such as A and B – are “primitive” in the RCM, and hence, counterfactuals and identification are each with respect to the measured variables. Researchers can identify that some cause occurred in a given experiment when the expected value of the variable B differs between counterfactual states characterized by A . Equally importantly, identification strategies often leave the scope of generalization vague or implicitly local to the experimental setting.

Nevertheless, researchers rarely interpret their findings in terms of only the measured variables and are often not explicit about their limited scope of generalization. Instead, scholars typically interpret their findings in terms of general causes and general effects that generalize beyond their study’s setting. For instance, GSL might claim “enforcement messages” cause “juror turnout.” As Shadish et al. [18] forcefully argue, researchers’ semantic statements are virtually always in terms of underlying causes and effects and virtually never in terms of the measured variables themselves [48]. Furthermore, all experiments are embedded in a set of conditions that also matter to causality [19]. To be consistent with the intent of the Credibility Revolution, causal claims must be deductively true at the semantic level and hence must be supported by a complete set of assumptions.

Causes, effects, and the conditions in settings are referents in nature, and claims regarding their relationships are ontological. As a result, a deductively correct generalized causal claim requires not only the identification of a causal effect between measured variables (i.e., internal validity) but also the correct specification of the relationship between the measured variables and the actual cause and the actual effect (i.e., construct validity) and correct specification of the scope of the generalization (i.e., external validity).

Accurately specifying which aspects of the measured intervention had which relevant effects is essential to deducing a generalized causal claim [50–54]. Consider a typical causal process about which researchers wish to make a causal inference and claim. In any experiment, the intervention will contain a (potentially null) set of causes, which we label “active ingredients,” along with other elements that are not causal, or “inert ingredients” [55]. Likewise, the outcome will contain elements that are of interest (“the disease”) and not of direct interest (“symptoms”). We refer to the intervention’s active ingredients and the outcome of interest as the causal “relata.” GSL described the manipulated active ingredient as the “enforcement message,” and “juror turnout” as the outcome of interest. The active ingredients in the intervention will be supported or countered by conditions in the setting [56,57]. For GSL, ingredients such as affluence are perhaps present in coastal Orange County but not in inland Riverside County.

Using a classic notation from philosophy, we formalize the (metaphysical) causal relata and conditions using the following simple causal claim:

(3) “ α causes β in γ , ”

where α and β are types or classes of ontological events, which can be represented by random variables, and γ is a set of supporting or countering conditions. We assume that causation is a relation between individual concrete events (see Schaffer et al. [58], for a review of metaphysical alternatives), indicated with subscript i . For GSL, α is the class of events in which summoned jurors receive a postcard with such-and-such a content, and α i is an individual event of a particular summoned juror receiving a particular postcard. Following convention in philosophy, we use quotation marks to designate a semantic claim (i.e., a verbal claim about facts in nature, as opposed to actual facts in nature). A particular causal claim is a historical, hypothetical, or predictive statement about the relationship between two individual events: “ α i caused (or would have caused or will cause) β i .” A general causal claim or causal generalization asserts a pattern among the relata, that “events of type α cause events of type β .”

Often, as we have just done, the conditions or settings in which the generalization holds are left implicit or unstated. Although some causal generalizations in physics might be truly universal ( α causes β whenever and wherever α occurs), in social science, causal generalizations are nearly always realistically restricted to settings with relevant local conditions, γ , that are often vaguely stated or understood [19,31].^[9] For example, GSL might only expect postcards to work in functioning democracies and among literate participants, even if they did not explicitly say so.

The statement “ α causes β in γ ” is a generalization: it is a claim that one thing generally causes another in certain conditions [59]. We say that a generalized causal claim is valid if the claim is true, i.e., if the purported cause and purported effect are the actual cause and actual effect. Our definition of validity comes from measurement theory originating in Kelly [60] and is consistent with Borsboom et al. [61].^[10] In our framework, the general causal claim that “ α causes β in γ ” is valid if it is indeed the case that α causes β in γ . Hence, validity is a relationship between a claim and the world – the relationship that holds if and only if the claim correctly reflects reality (cf. ref [18, p. 35]).

There are exactly four ways in which the causal generalization “ α causes β in γ ” can be invalid, corresponding to the four parts of the claim:

α
causes
β
in γ .

Something might cause β in γ , but that something might not be events of type α (falsity in part i). Events of type α might cause something in γ , but that something might not be events of type β (falsity in part iii). Events of type α might cause events of type β in some conditions, but not in γ (falsity in part iv). Or events of type α might be related to events of type β across conditions γ but the relationship might not be a directional causal relationship of the sort claimed (falsity in part ii).

To illustrate, consider how GSL’s initial causal generalization, “Enforcement messages cause juror turnout,” might fail:

Their claim might fail because the researchers misconstrued the nature of the cause, assigning an incorrect semantic label. The postcards’ effect might not be due to the specific words but rather because the text was longer in the enforcement compared to reminder message.
Their claim might fail internally, due to chance or poor experimental design. Maybe jurors who were already planning to appear at court disproportionately received the threatening postcards.
Their claim might fail because the researchers misconstrued the nature of the outcome, assigning an incorrect semantic label. Maybe juror turnout was mismeasured – for example, if excused absences were classified as successful recruitments.
Their claim might fail because the researchers mischaracterized how broadly their claim generalizes. The claim invites the reader to generalize across the US (though unfortunately this remains vague); but perhaps it only works in certain communities.

Generalizations always go beyond the scientific evidence. Researchers will have witnessed only a finite number of events in specific times and places. To make general causal claims that are meaningful to others, researchers must make a causal inference, moving from the evidence to a causal claim on the basis of theory, design-based procedures, substantive knowledge, and even common sense – i.e., assumptions combined with the evidence [15,17,25]. A causal generalization results from an inferential leap to the conclusion that in general, under conditions γ , α -type events cause (or often enough cause, or cause absent interference) β -type events. Such an inference may or may not be warranted, but without an inference, even if a study induced causality it could not support a general causal claim.^[11]

Now consider how one deduces generalized causal claims using evidence from an experiment. Our framework distinguishes active from inert ingredients as events that are bundled within the measures of the intervention and outcome, i.e., we do not take measured variables as the “primitives” of the analysis. To clarify this distinction, we label the ontological elements of the bundles (the causes, effects, and conditions of interest, plus inert ingredients) with lowercase Greek letters, and we label the measured bundles with uppercase Latin letters. In the fully binary case,

(4a) A ≜ { α ∧ θ α } ,

(4b) B ≜ { β ∨ θ β } ,

(4c) C ≜ { γ ∧ θ γ } .

Note that ≜ means “is definitionally equal to,” ∧ logically means “and,” (requiring both elements to be true for the expression to be true), and ∨ logically means “or” (requiring one or both elements to be true for the expression to be true). In the binary case, each variable can be either true or false.

The Latin letters A , B , and C are observed measures of the intervention, outcome, and conditions in settings, respectively, here assumed to be measured without error [65]. The Greek letters represent hypothesized causes, effects, causal conditions, and inert elements, which combine into informationally equivalent sets [66]. The elements { α , γ } are “active ingredients” that have a causal effect on the outcome of interest β . The elements { θ α , θ γ } are “inert ingredients” and θ β is a related outcome that is not of interest. For simplicity, we omit interactions between elements of the sets.^[12]

This notation makes clear that measured variables are inherently bundles of the ontological components of interest [53]. In particular, since active and inert ingredients { α , θ α } are bundled in the intervention, removing the inert ingredient θ α in equation (4a) would make A false. For example, in the first GSL trial, the postcards bundled the enforcement message with an emotional tone, numerals related to the relevant statute, amount of ink, and sentence length and complexity (all varying at least slightly between treatment and control). Likewise, C bundles all of the conditions in the setting { γ , θ γ } that remain constant. These include design elements that are identical for treatment and control (e.g., the cardstock and court seal), attributes of the units that are assumed to be balanced through randomization (e.g., employment status of the recipient), and features of the context (e.g., Riverside during the rainy season). Typically, neither these conditions nor C itself is literally “measured” beyond disclosures of experimental procedures, units, and the setting of the RCT.

The disjunction in (4b) represents that the measurement of an outcome might reflect the real outcome of interest, β , or instead a related event not of direct interest, θ β (e.g., a legally valid request for excuse). In most studies, β itself cannot be measured in isolation but must be inferred from self-reports or records ( B ) assumed to stand in some felicitous causal relation to β . For simplicity, our notation omits cases in which β occurs but remains unmeasured, modeling the risk of false positives but not false negatives.

In an RCT, the evidence is limited to the measured variable bundles. Typically, however, researchers’ causal claims reference the ontological events here represented as Greek letters (“enforcement message” and “turnout”), not the measured bundles characterized by Latin letters (“A” and “B”). The definition in (4) makes explicit that the Greek-letter reality behind the Latin-letter measures remains a matter of inference, and this is true even when the causal estimand has internal validity (as in Keele and Minozzi [16]). Under our assumptions, internal validity and textbook identification establish the following specific causal claim as a fact about what has actually been manipulated and measured:

(5) ATE C = E S [ B i ( A i = 1 , C ) ] − E S [ B i ( A i = 0 , C ) ] ,

where C bundles the conditions in the setting, which are either balanced or constant across units. The quotes make it explicit that (5) is a claim. Thus, internal validity only warrants claims with respect to the Latin-letter variables – literally only that “the average treatment effect is a causal relationship between variables A and B in setting C ” – not the Greek-letter causal relata and causal conditions that are present in the world and that are of substantive interest. It follows from definition (4) that the fact of claim (5) is not the same as the (generalized) causal process of interest τ , which is expressed semantically as counterfactuals regarding states of α and γ ,

(6) τ γ = E S [ β i ( α i = 1 , γ ) ] − E S [ β i ( α i = 0 , γ ) ] ,

i.e., the substantive, semantic claim about the ontological process of interest – “enforcement messages cause turnout in the absence of affluence.” As our notation makes plain, even if an identification strategy justifies claim (5), that by itself does not deductively justify making the claim about the causal process meaningfully expressed in claim (6). To see this, we expand claim (5) using definition (4) to the equivalent statement,

(7) ATE = E S { [ β i ∨ θ β i ] ( [ α i ∧ θ α i ] = 1 , [ γ ∧ θ γ ] ) } − E S { [ β i ∨ θ β i ] ( [ α i ∧ θ α i ] = 0 , [ γ ∧ θ γ ] ) } .

Plainly, moving from identifying the A T E in (5) to deducing the generalized causal effect τ in (6) requires an extensive set of assumptions about claim (7) that go beyond the assumption of internal validity. Comparing claim (5) to claim (7), expanding A problematizes construct validity of the cause; expanding B problematizes construct validity of the outcome; and expanding C problematizes external validity.

Of course, if researchers literally only care about the measured variables A and B as manifested in the exact setting C , they need not make assumptions about the relationships between A and α , B and β , and C and γ . However, they would then be unable to communicate the meaning of their results beyond saying “Whatever it was we did (designated by the symbol A ), at that one time and place (designated by the symbol C ), had an effect on whatever it was we measured (designated by the symbol B )” – a claim that would never be published and indeed is not even a generalization [35]. Since textbook identification strategies are only with respect to the symbolic representations that indicate measured variables, identification alone can only deductively license a claim such as (5). Claim (5) does not license the meaningful semantic claim (6) that is the verbal statement of interest in applied research [55].

Note that a claim such as (5) can be the result of a deductive test, under assumptions of internal validity, using standard procedures of causal identification. It can be what Gelman and Imbens [14] call a “what if” rather than a “why”-type assessment. In moving implicitly from (5) to (6), however, the researcher transforms a “what if” question among measured variables into a speculative or exploratory “why” accounting of the relata and conditions driving the statistical patterns that support claim (5). This is the case even when claim (5) is identified. When researchers make a generalized causal claim without addressing the relationship between the Latin-letter measured variables and the Greek-letter causal relata and conditions, they are simply hazarding an exploratory guess or “structured speculation” about the underlying causality [67], with the hope that causal knowledge will somehow accumulate coherently from a sequence of results such as (5) [1, p. 23]. This hope for accumulation of knowledge under speculation is similar to when earlier generations of applied statisticians made exploratory guesses and offered verbal assurances about control variables achieving internal validity.

The authors of the Credibility Revolution partially acknowledge this limitation and attempt to place boundaries on claims that are “credibly” established under internal validity through a familiar saying that identification can deductively establish the effects of a cause – that is to establish a fact that a cause occurred by comparing outcomes across counterfactual states of the world – but not the causes of effects that would require naming the actual causes [20]. This distinction fails to establish these bounds, however. Indeed, communicating any description of the counterfactual states over which a cause and effect are detected requires labeling those states in some way. Even if those labels are abstract or generic, this still requires an assumption of construct validity. And to assume the cause will occur in any other time or place, or even generically assume that causal effects are homogeneous, requires an assumption of external validity. Even if one were to try to communicate a finding modestly as an “effect of cause”-type claim, that would not relieve one from considering these aspects of validity.

Because identification is only with respect to measured variables in one setting, identification does not address parts (i), (iii), or (iv) of a generalized causal claim, i.e., the labeling of the relata and the conditions under which the cause will occur. Since a causal generalization is not deductively valid unless all four aspects of the claim are correct, causal identification does not provide sufficient assumptions to deduce a generalized causal claim. Thus, we augment causal identification with what we call causal specification, which formalizes the insights of Shadish et al. [18] into the single expression of claim (7) that shows the equal importance of each type of validity in enabling deductive causal claims. In causal specification, construct validity is present when α and β are correctly specified, and external validity is present when γ is correctly specified.

Causal specification clarifies the intimate relationship between each type of validity and deductive causal inference. Recall that we say a causal claim is “valid” if the claim accurately reflects the ontological causal process found in nature. Since the underlying causal generalization is never directly observed, the deduction can only follow from the assumptions corresponding to internal, construct, and external validity. If any one of the assumptions is missing, the conclusion is not deductively supported. But note as well that if any of the assumptions is false, the claim might follow deductively from the assumptions (and thus be “deductively valid” as commonly used in philosophy) even though it is not a valid causal claim. For causal knowledge to accumulate, the best that the researcher can do is to provide sufficient warrants for each assumption – using theory, research design, qualitative knowledge, or justified intuition.

Causal specification also calls into question any assertion that internal validity has priority over construct or external validity. Such a claim dates to Campbell [68, p. 297] who stated internal validity as a “basic minimum” for science, and when comparing internal to external validity, he writes, “Internal validity is the prior and indispensable consideration.” For a more recent statement, see Guala [69, p. 1198]. The reasoning holds that if an experiment is confounded, nothing can be learned from it, and hence, internal validity has a lexical priority over considerations regarding the other types of validity.

However, in the present framework, deductive validity does not require that the validities must be considered in any specific sequence or ranked in their priorities, or that internal validity must be present before one can consider the other types of validity. For example, it is unclear how one can say that knowing an internally valid (directional, unconfounded) cause occurred must precede knowing what was the cause and what was the effect (i.e., construct validity). Similarly, it is also unclear how one can say knowing an internally valid cause occurred can precede knowing whether the setting has the necessary conditions to enable the cause (i.e., external validity). Because all three validities are necessary, no causal generalization can be successfully deduced without all three. None stands separate and prior.

Table 1 summarizes the key concepts we introduce in this section. In the remainder of this review article, we survey the concepts of construct and external validity as they fit within the framework of causal specification, showing the necessity of each type of assumption to preserve the deductivenss of causal claims. The formal approach using the potential outcomes framework can be extended for each of internal, construct, and external validity. However, we confine those formalizations to Appendix A. In addition, for interested readers, Appendix B shows our arguments in an SCM framework. These appendices can be skipped without loss of continuity.

Table 1

Key terms and definitions

Term	Definition
Causal generalization	A claim of the form “ α causes β in γ ,” where “ α ,” “ β ,” and “ γ ” refer to event types.
Validity	The claim “ α causes β in γ ” is valid if α causes β in γ
A causal generalization is …
	Internally valid if a directional, unconfounded causal relationship of the claimed type is present. Internal validity is often warranted under identification of a causal effect between measured variables
	Construct valid if the both cause ( α ) and effect ( β ) have been correctly specified as “ α ” and “ β ,” and other events have not been incorrectly specified as the cause or effect
	Externally valid if the causal conditions ( γ ) have been correctly specified as “ γ ,” and other conditions have not been incorrectly specified
Causal specification	The process of stating and warranting the assumptions regarding internal, construct and external validity that are necessary to deduce the claim “ α causes β in γ ” from observed events A and B in conditions C . If any of the assumptions are missing, the claim will be deductively unsupported.

4 Causal specification for construct validity

Traditionally, construct validity centers on considerations of the quality of observed measures when a criterion measure does not exist, to ensure that the variable that is measured in fact corresponds with the concept of interest [70,71]. In causal analysis, this means the semantic labels assigned to the causal relata are correct [55]. The notion originates in Cronbach and Meehl [72] who proposed assessing whether the pattern of convergences and divergences in a set of correlations meets the theoretical expectations of a “nomological network.” As Borsboom et al. [61] explain, such an analysis of correlations might warrant the match between a measure and the underlying ontological referent, but this only serves as an empirical validation procedure to support an assumption of validity [73].

According to Borsboom et al. [61], a measure is construct valid if measured observations are themselves caused by the underlying (ontological) referent of interest. Referring back to our definition (4), in our framework, a causal generalization is construct valid only if α and β , inferred from observations of A and B , are the real underlying cause and effect. As ontological referents, α and β are latent and so not normally readily measurable. Instead, the correspondence between the measured variables and the intended relata is a (possibly warranted) assumption, governed by considerations of construct validity, just as the presence of internal validity is an assumption. Assigning correct semantic labels “ α ” and “ β ” to the causal relata thus stands as one of the core inferential risks when making deductive causal claims based on the statistical relationship between A and B [48]. Without an explicit assumption and justification for their semantic labels, researchers cannot properly claim to have deduced a generalized causal effect. One knows only that something caused something, not what causes what.

4.1 Construct validity of the cause

Construct validity is essential for understanding the role of the intervention as a possible causal agent, and so, we first consider construct validity of the cause. Generally, analysts claim the specific physical properties of an intervention stand as an instance of an underlying causal referent [74]. For example, Gerber et al. [21] take the text statement on a postcard promising to reveal one’s voting behavior to one’s neighbors as an instance of “social pressure,” much like GSL took their text to be an instance of “enforcement.” The correspondence between the observed intervention and the underlying construct is necessarily imperfect, however [70, p. 534]. For example, different physical manifestations can correspond to the same referent depending on the context, such as when Dunning [75, p. 43] devise different informational voting interventions to match across implementations in Latin America, South Asia, and Africa.

In a proposed empirical test of the causal process, i.e., whether A causes B , the manipulated intervention variable A is presumed to contain at least one necessary component (active ingredient) for the cause to occur [50,51]. Every intervention must be a bundle of components, however, some of which are active ( α ) and some of which are inert ( θ α ). Establishing internal validity alone cannot warrant assigning the label “active ingredient” to any of the elements in the intervention because the manipulation itself is always potentially confounded with inert ingredients labeled as active, or vice versa [55, p. 379, 382] (see also Fong and Grimmer [76]). This is the problem Dafoe et al. [66] identify as “informational equivalence.” Instead, as a minimum requirement, a deductively valid generalized causal claim must assume and specify the active ingredient α and assign to it a construct valid, semantically meaningful label.

The active ingredients in social science interventions typically are not as easily identified as in the case of drug trials. For the GSL example, the manipulation is not only the enforcement message but everything else bundled with the intervention, including the level of threat, the presence of the numerals indicating the statute, sentence complexity, and so on [76]. Because the inert and active ingredients perfectly covary within a well designed RCT, even a gold standard RCT cannot by itself distinguish the active from the inert ingredients. Furthermore, some elements might not be entirely ontologically distinct, such as “enforcement” and “threat” in the GSL example. Even if the elements are sufficiently distinguishable, conceptually and empirically, which ingredient best characterizes what is actually driving outcomes remains an open question.

The Credibility Revolution understands aspects of this problem of confounding in the intervention. However, they address the problem principally by stipulating ancillary assumptions to causal identification rather than treating construct validity as a core element of causal generalization. In particular, when there is full compliance with the protocol, RCT designs rely on two assumptions in addition to the assumption of randomization, known as the “exclusion restriction” and the “stable unit treatment value assumption” (SUTVA) [10,49]. The exclusion restriction and SUTVA allow one to ignore each unit’s assignment and the assignment and exposure vector of all other units, and so the two assumptions greatly reduce the number of potential outcomes to consider [49]. Substantively, these two assumptions rule out certain, but not all, aspects of confounding within the intervention that can remain even in an otherwise perfect design [77], rendering the deductive claim false.

First, consider the exclusion restriction, which assumes that the assignment itself has no direct or indirect effect on the outcome other than through the treatment. Absent blinding, random assignment can create confounds such as John Henry and Hawthorne effects that occur simply because the unit is aware of assignment. To assume the assignment itself is not causal under the exclusion restriction is to assume that the assignment is not among the active ingredients. Although the exclusion restriction labels the assignment process as an inert component, it does not label the active component of the intervention [77, p. 176], and hence, does not address construct validity.

Second, SUTVA assumes that the treatment each unit receives is not affected by other units, irrespective of whether the other units were assigned to treatment or control. For example, SUTVA rules out the presence of spillover from the treatment units to the control units, such as when someone in GSL’s treatment group is friends with someone in the control group and so shares the postcard message. Randomization in an RCT does not rule out this scenario, and hence, the analyst must assume the active ingredients that define the treatment are the ones that the analyst had intended. But similar to the exclusion restriction, SUTVA does not label the active component of the intervention and so also does not address construct validity.

Construct validity requires semantic labels to match the referents. The exclusion restriction and SUTVA, although important, are insufficient to deductively guarantee such a match. That the labels are adequately justified is thus a further assumption, normally grounded in substantive theoretical or substantive knowledge about what is (the threat of enforcement) and is not (the exact length of the sentences) likely to be an active ingredient in the intervention.

4.2 Construct validity of the outcome

Correctly identifying and measuring the outcome of interest is also essential for causal identification. In a clinical trial, for example, one might relieve the symptoms and mistakenly conclude one has cured the underlying disease. For example, this could occur if one uses a fever as the measure of disease, applies ice to the patient, and then claims the disease has been cured. In the GSL example, the intervention aims to increase juror turnout, but the jury administrator might record an excuse from service as also having fulfilled the legal requirements.

Construct validity of the outcome is present when the outcome is correctly labeled and conceptualized [71]. The measured outcome B might stand in a variety of relationships to the outcome of interest β . In some cases, β itself might be directly measurable (e.g., response time) in which case B = β . More commonly, B and β stand in some causal relationship, where B is a presumed cause or effect of β or the two are related to a common cause. For GSL, if β is juror turnout, B might be the clerk’s record of which residents reported on the assigned day, which could be entirely accurate or contain false positives or negatives. Generally, the tighter the causal relationship between β and B , the better the warrant for inferring from the directly observed B to the claimed β .

In short, when B is observed, it might be true that the outcome of interest β occurred or (in false-positive cases), it might be true that only a related outcome not of interest θ β might have occurred. Establishing internal validity does not establish the existence of the required relationship between β and B . Absent specification that B captures β , one cannot properly claim to have specified the real outcome. Hence, construct validity of the outcome would be lacking, and again, the deduction would not be true.

5 Causal specification for external validity

Traditionally, external validity focuses on whether an identified causal effect extrapolates, transports, or is generalizable to other settings [17,18,77–79].^[13] “Settings” include different countries, time periods, populations, contexts, and laboratories. Although this definition is standard, it is often viewed as an unattainable ideal [17,80]. Very few social science studies yield the same results across all settings of human existence [77,78]. Indeed, despite internal validity, RCTs usually yield substantially varying results across settings [80–85]. This is the case, of course, if there has been inconsistent execution of RCTs. However, low external validity in the traditional sense is quite common across RCTs even when executed consistently [80,86,87].

5.1 Clarifying the definition of external validity

Partly because of the pervasive lack of traditionally defined external validity, we propose a clarification of the definition of external validity used in the causal inference literature. This clarification then helps to establish the relationship between external validity and causal deduction.

First, we define causal conditions ( γ ) as any active ingredients that are balanced across treatment and control groups, or conditions that are constant in a setting or causal field [50]. Like Cartwright’s [57] “helping factors” and “countering causes,” Deaton and Cartwright’s [80] “support factors,” and Findley and colleagues’ [17] “context or structural factors,” causal conditions can augment or undermine a cause. These conditions include quintessential characteristics of settings like culture and institutions [19,88–91]. In the classic example [92], oxygen is an active condition that is necessary for the treatment effect of striking a match to result in the outcome of fire.^[14] As we explain below, other conditions are inert (e.g., nitrogen in the air).

Second, we clarify that external validity requires the correct specification of the conditions ( γ ) under which the causal generalization operates. Thus, defending a claim of external validity requires evidence or assumptions about what conditions enable the treatment to produce its effects. According to the traditional definition, a claim is externally valid if it generalizes across settings. We say a claim is externally valid to the extent one has accurately specified why or how the effect generalizes across settings. This means specifying the range of conditions in settings γ across which the effect generalizes. Hence, we define external validity as the correct specification of the conditions that enable/disable or augment/weaken a causal effect (for a similar definition, see refs [17,33]). External validity is present when that specification is true – i.e., when γ is correctly labeled.

This revised definition is more general than the traditional definition and subsumes it as a special case. The traditional definition that implies that external validity is present only when α causes β in a wide range of conditions γ or where the conditions γ are widespread. Our definition is also more attainable. Unlike the traditional definition, we embrace the reality that treatment effects will vary across settings because of the inescapable role of conditions that generally also matter for the outcome [79,80,84,85,87].

Our approach reveals that it is as important to know the settings where a cause will not occur as to know the settings where it will occur. A claim can have high external validity if the researchers accurately claim a relationship between α and β holds across a wide range of conditions. But a claim can also have high external validity if the researchers have specified α causes β only in a very limited, but precisely defined, range of conditions. In our approach, there is no error, no failure describable as a lack of “validity,” if the researchers accurately and convincingly claim that α causes β only in a very limited range of conditions, provided those conditions themselves are explicitly specified. In addition to aiming to find causal relationships that hold across a wide range of settings, it advances external validity to clarify in what conditions causal relationships do and do not hold. Though we do not attempt to model it here, researchers might also specify how α ’s influence on β varies in effect size across settings, e.g., a 1% increase in such-and-such conditions vs a 5% increase in such-and-such other conditions.

Thus, a deductive and general understanding of causality requires specifying how a causal claim is contingent on specific conditions. While GSL lacked traditionally defined external validity, GSL can attain external validity under our definition by correctly specifying which conditions moderate the treatment and the range of settings in which the treatment will have the described effect. For example, GSL might successfully justify the assumption that the intervention works in the setting of Riverside but not in Orange County by specification of the condition of affluence.

Our revision embraces and is fully compatible with recent efforts in applied statistics to incorporate external validity into causal frameworks.^[15] The existing approaches to generalizing results to target populations depend in some form on an assumption of selection on observables (e.g., see Kern et al. [93]) or sensitivity analyses for unobserved moderators (e.g., see Nguyen et al. [94]). For example, Pearl and colleagues’ transportability approach [34] specifies which conditions are modifying the causal effect. Knowing “where” and “why” in the directed acyclic graph that effect moderation is occurring, however, requires assumptions of conditions in settings [95]. Propensity score approaches require knowing which conditions to include in the propensity score model and on what conditions to compare sample and target population (e.g., refs [82,96]). While these approaches are designed to address the problems of generalization and transportability in causal inference, none of them identify the necessity of external (or construct) validity for causal deduction.

5.2 Leaving the confines of the historicist’s refuge

For several decades, the social sciences have prioritized internal validity over external validity [17,77,82]. As a result, studies rooted in the Credibility Revolution often sidestep external validity by simply declaring a causal effect is “local.” Indeed, RCTs like GSL’s often are framed as only providing a local causal effect.

However, if one has identified only a local causal effect without external validity, one can only produce a very limited kind of knowledge. Absent external validity, any claim must be circumscribed to a specific set of units exposed to a specific event in a specific time and specific place and is only knowable retrospectively [51,56,79,81,83,97,98]. Cartwright [57] explains that internal validity only shows “it works somewhere.” Actually, internal validity only shows it historically worked (in the past tense) somewhere [99]. As Cronbach [35, p. 137] explains, internal validity alone is “trivial, past-tense, and local.” This is not the sort of general causal knowledge social scientists typically want to accumulate [17]. Cartwright [57] explains, we want to know: “it works widely” or “it will work for us” [78,80,82,88]. For this exact reason, Rubin [2] originally stressed the need for “subjective random sampling” of settings to ensure a study was of “practical interest,” “representative,” and “useful.”

Confronted with this, some might claim that identifying an effect in one setting is sufficient and they have no intention to deduce generalized knowledge. GSL might say they simply are testing the effect of an intervention in Orange County only and it is beyond their study to ask whether the treatment effects generalize. They may even say that one setting at one time defines their “population” and their sample generalizes to that population. We propose that by claiming they only intend to make a specific historical claim about the effect of something in only one setting – which is actually a time and place with certain very specific conditions – they are retreating to what we call the historicist’s refuge.

Historians’ idiographic causal narratives of specific events are certainly valuable. Indeed, as Kocher and Monteiro [30] note, such knowledge is essential for developing and justifying research designs for natural experiments. Nevertheless, we doubt that social scientists truly have no desire to be different from historians [17,86]. As Nosek and Errington [99, p. 3] explain, social scientists rarely limit their inferences to a “particular climate, at particular times of day, at a particular point in history, with a particular measurement method, using particular assessments, with a particular sample.” Indeed, we choose topics to study because they are instances of some generalization. For example, GSL’s study was an instance of the general phenomena of jury service or democratic participation. When researchers intentionally choose topics to understand general phenomena, it is simply not credible to back out of the generalization by declaring post hoc that causal effects are “local.”

However, if one is truly only making a “local” claim in the historicist’s refuge, this would require explicit language. Just as the credibility revolution requires causal identification for any language of causal effects, historicists should declare their inability to generalize to any setting other than the one experimental setting at the one time when the experiment was actually conducted. Readers could police against any language of generalization just like readers currently police against causal language in observational studies. It seems fair to note that such a practice would probably require fairly dramatic changes to studies of the US, which are rarely forced to justify selecting the highly unusual US case [17,86].

Even with careful language however, the historicist’s refuge cannot lead to a coherent, deductive understanding of causality [88]. Historicists in their refuge claim to identify a causal effect while having no understanding of the conditions in the setting enabling that effect. One does not know how much of the effect is due to the treatment or some complex interaction between the treatment and conditions in the setting. One does not even know what the relevant conditions might be. Hence, the lack of external validity reveals a lack of understanding about what really causes what.

Furthermore, any commitment to replication forces one to inevitably abandon the historicist’s refuge. If even one other setting yields a different result, beyond sampling variability, this proves conditions in settings matter. Once GSL found a different result in Orange County, they needed to confront if the treatment effect is aided by helping factors in Riverside County or suppressed by countering causes in Orange County (or both). Unknown conditions might even make both Riverside and Orange unusual. If the conditions are unusual, then the causal effect could be unusually large or small. Any broader generalization would suffer from a selection bias just like any sample selection bias (as emphasized in Findley et al. [17]).

In response to these challenges, some might admit a lack of external validity and say the “next step” is to go forth inductively across a range of settings. For instance, Banerjee and Duflo [5, p. 162] write: “If we were prepared to carry out enough experiments in varied enough locations, we could learn as much as we want to know about the distribution of the treatment effects across sites.” This is not feasible, however, without causal specification [88]. Sampling a “range of settings” or “similar settings” presumes one knows what defines the range or similarity. The law of large numbers does not ensure representativeness if one is sampling from a corner of the sample space, and “simple enumerative induction" does not warrant claiming the treatment “reliably promotes” outcomes [57]. Choosing appropriate settings to evaluate variations in treatment effect sizes requires causal specification about what elements enable the cause that also vary across settings [5,80]. In the GSL vignette, choosing Orange County as the next step is merely haphazard without some understanding of what conditions might be relevant.

Ultimately, neglecting external validity raises concerns very similar to the “radical skepticism” about unknown confounding [47]. Verbal assurances of external validity strain credibility just like verbal assurances about control variables [80]. Radical skeptics about unknown confounding should be similarly radically skeptical about any generalization based on unknown conditions [31].

6 Causal specification and paths forward toward deductive causal inference

Our causal specification framework clarifies the assumptions that must be added to prevailing textbook identification strategies to support credible deductive claims about causal effects. In the absence of construct and external validity, the design-based researcher converts a deductive “what if” question to an exploratory “why” question [14]. Researchers might implicitly speculate that α is the active ingredient in A , that B accurately tracks β , and that the causal conditions γ in C operate similarly elsewhere. These, however, are assumptions that underwrite deductive causal claims of the nature of the cause, the nature of the effect, and the scope of the generalization. Failing to make it explicit that these are assumptions undermines the deductiveness of any causal claim even when internal validity is present. Indeed, as practitioners of the Credibility Revolution know well, internal validity itself remains an assumption even after randomization and even after balance tests have been passed [100,101].

The inescapable role of assumptions leads to our first recommendation. Our causal specification framework makes clear that ontological problems of how to assign semantic labels to causes, outcomes, and conditions cannot be solved by statistical procedures [48]. Instead, as Cook [100], Deaton and Cartwright [101], Kocher and Monteiro [30, p. 953], Pearl and MacKenzie [25], Slough [26] – and many others – emphasize, the assumptions that underwrite a design for causal inference are not statistical but instead are derived from theory, and substantive knowledge, including hypotheses about the concrete mechanisms involved [54,91,102] (e.g., that GSL’s prospective jurors were motivated by fear of punishment).

This means that substantive domain knowledge must always be a partner in deductive causal inference. One can develop a theoretical and substantive understanding of the underlying referents that correspond to the actual causal relata and the relevant conditions in settings through immersion in descriptive [103], historical [30], or qualitative empirical evidence [104], or even by well-reasoned intuition [25]. Such theory and substantive knowledge enable researchers to specify relatively more plausible background assumptions concerning construct and external validity. Acknowledging this highlights the essential role of descriptive, historical, and qualitative evidence for scientific progress even within the most sophisticated statistical techniques [30,105–107]. Acknowledging this also highlights the essential role of interdisciplinarity and collaboration between methodologists and substantive experts.

We propose three other directions forward beyond recognizing the inescapable role of assumptions, theory, and substantive knowledge. First, claims regarding construct and external validity must be supported with the same rigor, precision, and care as claims regarding internal validity. As we mention above, the credibility revolution ushered in a normative change about language regarding causal effects. The causal specification framework encourages another normative change toward greater circumspection and modesty about what can be claimed based on internal validity alone. For instance, even if researchers want to keep referring to “causal effects” based on internal validity, we urge a shift to “a” rather than “the” causal effect. Even if one claims “a” causal effect of a certain treatment in one setting, that certainly does not warrant claims about “the” causal effect in general and across settings.

Closely related, we recommend that researchers who wish to make deductive contributions to our understanding of causal processes explicitly specify their α s, β s, γ s and defend the inertness of their θ s. For researchers who already take construct and external validity explicitly into account, this might amount to only a more formal statement of their assumptions. For other researchers, like GSL, this might require confronting and clarifying causal assumptions through substantive knowledge that they would otherwise disregard and leave implicit in the background.

Second, the scientific community should be encouraged to conduct what Ankel-Peters and colleagues [108] call “policing replications” of studies evidencing internal validity. Rather than presuming that strong internal validity and causal identification provide strong support by itself for causal generalization, the scientific community should treat such evidence as only preliminary. Such evidence should always need to be tested in other settings to understand external validity, and such evidence should always be scrutinized for construct validity.

In recent years, many have tested for variation in treatment effects across settings [109,110]. Often, such replications reveal serious questions about the generalized scope of a causal effect. For instance, Henrich and colleagues [86] show that internally valid psychological experiments often fail to generalize outside rich, Western democracies. Development economics has seen similar failures for RCTs to replicate [82,83,85]. By demonstrating that causal effect heterogeneity is pervasive and large, such replications illustrate convincingly the gap between an internally valid effect and a causal generalization.

Beyond external validity, policing replications can also open up questions about construct validity. Such studies force us to ask harder questions about whether an observed treatment is actually validly measuring the same active ingredient of the underlying construct across various settings and respondents. For instance, Gilbert et al. [111] criticize the Psychology Reproducibility Project for assuming different treatments have the same construct validity in replications of experiments across many settings. Gilbert and colleagues point out that it is unclear that an original experiment asking Israelis to imagine the consequences of military service reflects the same underlying construct as a replication asking Americans to imagine the consequences of a honeymoon. Similarly, Stich and Machery [112] demonstrate that philosophical cognition experiments exhibit heterogeneity across settings in part because of demographic differences in what treatments mean and how they are understood by subjects. Their “geography of philosophy” project improves validity by exploring how reactions to thought experiments vary across settings (i.e., external validity), what features of the treatments drive participant reactions (i.e., construct validity of the cause), and how to interpret heterogeneous participants’ reactions (i.e., construct validity of the effect).

Third, we recommend a research agenda seeking to understand the external and construct validity challenges of internally valid studies. Here, researchers take internally valid studies and then model why and how variation in construct and external validity matters. Beyond showing that some experimental psychology lacks external validity [86], Muthukrishna and Henrich [113] investigate how cultural distance from the US explains the variation in treatment effects. Price and colleagues [114] reexamine 194 RCTs to show how anti-Black racism is a critical condition in communities that moderates/undermines the effects of psychotherapy on youths. Also, Price and colleagues [115] show that sexism is a critical condition across settings that moderates how well psychotherapy helps girls. Of course, these studies will need to establish internal validity in the causal effects of such conditions across settings as well.

Finally, we readily acknowledge that this article (intentionally) raises more questions than it can answer. Our hope is to push causal inference scholars to re-examine the concepts of validity precisely because there are so many open and implicit questions. Specifically, it is essential to debate such things as, what criteria should we use to assess whether an article addresses construct and external validity satisfactorily? How best to navigate trade-offs between internal, external, and construct validity? What practical tools developed in the causal inference community can provide a more concrete path forward to best warrant assumptions that causal inference scholars currently leave implicit?

7 Conclusion

Social scientists typically aim to deduce general knowledge about what causes what in what conditions, and not just historical knowledge that something caused something one time in one setting in the past, i.e., social scientists aim to deduce valid causal generalizations. The tight linkage between the concept of internal validity and the concept of causality is encoded in the causal frameworks that have governed the Credibility Revolution. The RCM [2,20] and SCM [3] have made tremendous contributions while being centered on the problem of unconfoundedness and internal validity. However, the Credibility Revolution has not adequately recognized the necessary role of external and construct validity for causal deduction. We explain how these omissions undermine the Credibility Revolution’s own goal to understand causality deductively.

In this essay, we develop a causal specification framework to critically review the literature on quantitative causal inference. The challenge of causal specification is not only the challenge of confirming that in fact something caused something in one setting (the focus of internal validity) but equally the challenge of correctly labeling the nature of the cause, the nature of the effect, and the conditions under which the generalization holds. By itself, even the most rigorous proof of internal validity shows only that some aspect of the manipulation ( A but not necessarily α ) caused some measured outcome ( B but not necessarily β ) in one setting ( C , typically leaving γ implicit). Construct validity is achieved when the semantically asserted cause and effect are the actual cause and effect. External validity is achieved when the scope of the generalization is correctly specified. Unless all three validities are present, a substantive claim that “ α causes β in γ ” is false. Assumptions about all three types of validity are required for deduction; none has priority. Internal, construct, and external validity are three legs of the stool of causal generalization. And the causal generalization is only as strong as the weakest leg.

The causal specification framework shows that the textbook identification assumptions within “credible designs” are insufficient for deducing generalized causal claims. Identification focuses on internal validity but typically neglects construct and external validity. Social scientists who wish to make generalized and deductive causal claims instead must attend equally to internal, construct, and external validity. Specification of assumptions about construct and external validity can augment current approaches to identification and enable researchers to make coherent deductive causal claims. Moreover, these assumptions tend to inevitably rely on substantive, theoretically grounded, and verbally justified labeling of the relata for construct validity, and of the conditions for external validity. These additional assumptions regarding the relata and conditions are as necessary for deducing a generalized causal claim as are assumptions of internal validity.

If applied researchers ignore construct and external validity when stating causal claims, they mistakenly convert an intended deductive claim into a claim based on exploration and speculation – contrary to the fundamental goals of the Credibility Revolution. Our framework for causal specification corrects this, and offers a means for applied researchers to preserve the deductive nature of their claims not only at the level of measured variables but also – more importantly – at the level of relata and conditions. In this way, causal specification clarifies the additional assumptions required for the Credibility Revolution to achieve its aspirations of understanding causal effects.

Acknowledgements

This work was prepared for presentation at the Metascience 2023 Conference, National Academy of Sciences, Washington, DC, May 9, 2023. Previous versions presented at the 2021 Annual Summer Meeting of the Society for Political Methodology, the 2021 Annual Meeting of the American Political Science Association, and the WZB Talks Series, July 2020. We thank Elias Bareinboim, Michael Bates, Shaun Bowler, Nancy Cartwright, Carlos Cinelli, Justin Esarey, Uljana Feest, Diogo Ferrari, Christian Fong, Don Green, Justin Grimmer, Francesco Guala, Steffen Huck, Macartan Humphreys, Robert Kaestner, Sampada KC, Jon Krosnick, Dorothea Kubler, Doug Lauen, Joscha Legewie, Kevin Munger, Michael Neblo, Jörg Peters, Alex Rosenberg, Tara Slough, Heike Solga, Jacqueline Sullivan, Nicholas Weller, Bernhard Wessels, Ang Yu, and the participants in the MAMA workshop in UCR Psychology for comments. The authors are grateful for the reviewer’s valuable comments that improved the manuscript.

Funding information: No external funding supported this work.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal and approved the final version of the manuscript. The authors contributed equally.
Conflict of interest: The authors state no conflict of interest.

Appendix A Formal statements of construct and external validity

Recall that we stated our general causal claim as the sentence: “ α causes β in γ .” Causal specification requires assumptions about each of these aspects of a causal process as well as their relationship. We formalize those assumptions in this appendix.

A.1 Causal specification for internal validity

In an RCT, identification requires internal validity, i.e., that the value of β under the counterfactual of α being either true or false is in fact unrelated to the realized value of α within an experiment, or

(A1) [ β i ( α = 1 ) , β i ( α = 0 ) ] ⊥ ⊥ α i ,

which is analogous to equation (1), except it is stated at the level of the causal relata, and we enclose it in quotes to highlight its status as a claim that might depart from the truth.

A.2 Causal specification for construct validity of the cause

Under our notation, a claim for weak construct validity of the cause takes the form,

(A2a) τ C V C W = E S { B i ( [ ( α i = 1 ) ∧ θ α i ] , C i ) } − E S { B i ( [ ( α i = 0 ) ∧ θ α i ] , C i ) } ∀ θ α

(A2b) τ C V C W ≠ 0 ,

where the ∀ symbol means “for each” – i.e., the cases where θ α is either true or false, ignoring cases in which θ α is true on one side of the statement and false on the other. When the claim is about a direction of the causal effect, such as if GSL were to claim that the postcards increase turnout, the inequality in statement A2b should be directional using either > or < , depending on the direction. Under this claim, the causal effect τ compares expected potential outcomes when α is present to when α is absent, both when θ α is present and when it is not. The claim in (A2) holds that the cause the researcher postulates to be the actual cause is in fact an actual cause. In other words, to support a deduced causal generalization, the cause α must be specified. If Claim (A2) is false, the claimed cause “ α ” is not a real cause α and construct validity is absent.

Our definition of construct validity cannot be accommodated in either the RCM or the SCM whenever practitioners in either framework take measured variables as primitive. In particular, to enable valid generalized causal claims, the RCM would need to relax the requirement that potential outcomes are defined over measured variables only [65]. Claim (A2) demonstrates the inadequacy of the exclusion restriction and SUTVA as a substitute for construct validity. Each of these is only a special case of assumptions regarding inert ingredients, for example, that θ α characterizes the assignment process or non-causal components of the intervention, without specifying α .

A claim of strong construct validity of the cause would add the following to claim (A2):

(A3) 0 ≈ E S { B i ( [ α i ∧ ( θ α i = 1 ) ] , C i ) } − E S { B i ( [ α i ∧ ( θ α i = 0 ) ] , C i ) } ∀ α .

When α is present, the expectation of B is the same irrespective of whether θ α is present or absent, and likewise when α is absent. Under our background assumptions, unless this statement is true the claimed inert ingredient “ θ α ” is not the real inert ingredient θ α , and hence, strong construct validity is lacking. The difference between the weak and strong version is that in the weak version, α is relevant to the outcome regardless of whether θ α is present. By contrast, the strong version adds that θ α ’s presence or absence is irrelevant to the outcome.

Regarding causal specification of construct validity of the outcome: The details of causal modeling of the relationship between B and β elude our basic notation.

A.3 Causal specification for external validity

Our definition of external validity relies on understanding conditions as all features of the setting, units, or design that are constant or balanced between values of α . Recall we define the conditions in the setting, C ≜ ( γ ∧ θ γ ) , i.e., C is true if both γ and θ γ are true; γ are the active ingredients in the setting that enable or disable the cause α , and θ γ are the inert ingredients that are also in the setting. Under our definition, the causal conditions γ must be correctly specified to make a valid causal generalization. The presence of θ γ clarifies there are features that are constant or balanced, many of which are ignorable. A claim of weak external validity is

(A4a) τ E V = E S { B i ( A i , [ ( γ i = 1 ) ∧ θ γ i ] ) } − E S { B i ( A i , [ ( γ i = 0 ) ∧ θ γ i ] ) } ∀ θ γ ,

(A4b) τ E V ≠ 0 .

Note the close parallel with weak construct validity of the cause in claim (A2a). When the claim is about a direction of the causal effect, such as in GSL’s claim that affluence reduces the effect of the enforcement message, the inequality in statement A4b should be directional using either > or < , depending on the direction. The causal effect of A on B depends on whether γ is present or absent, regardless of whether θ γ is present or absent. The RCM is not well equipped to handle considerations of external validity, given that its focus is on identifying local effects. The SCM addresses considerations of external validity using the notion of “transportability” described in Bareinboim and Pearl [34]. However, as Appendix B shows, we clarify that claims of transportability must be over latent conditions γ rather than measured contextual variables C .

A claim of strong external validity adds the following to claim (A4):

(A5) 0 ≈ E S { B i ( A i , [ γ i ∧ ( θ γ i = 1 ) ] ) } − E S { B i ( A i , [ γ i ∧ ( θ γ i = 0 ) ] ) } ∀ γ .

The strong external validity claim adds that θ γ is irrelevant to the expected outcome, provided that A is present and γ is constant. If the equality is false, the claimed inert condition “ θ γ ” is not a real inert condition θ γ , and strong external validity is absent.

B Appendix B: DAG representation

In this appendix, we approximate the argument in our text using directed acyclic graphs (DAGs) [3]. A DAG cannot represent the full argument for two reasons. First, as we show in definition (4) of the main text, we conceive of the measured variables, A , B , and C , as bundles of active and inert causal ingredients. In this sense, the measured variables are compositions. While the measured variables are bundles and hence, not exactly the causes, neither are the measured variables the effects of the causes – the part does not necessarily cause the whole, nor does the whole necessarily cause the part. However, under the consistency rule [116, p. 872], only causal nodes are permissible within a DAG, and hence the compositions that are at the core of our definition of validity are not permitted. As a result, a DAG lacks this flexibility and cannot represent the part-whole relationship as a subset relationship. Instead, it has to represent part-whole as either the parts causing the whole or the whole causing the parts. Given this limitation, we can only approximate our framework in a DAG by also assuming the elements themselves – the active and inert ingredients – cause the measured variables [61], which is in the traditional framework of measurement theory but not fully consistent with our causal specification framework.

Second, it is well known that DAGs cannot visually represent an effect modification [92], which is also central to our model of causality. Instead, the DAG can only reference a separate formal statement of the effect modification, such as the one we provide in claim (7) of the main text.

The DAG in Figure A1 is a representation of the generalized causal claim “ α causes β in γ ” as defined in claim (7) of the main text; this representation is valid if it corresponds to nature. For completeness, we introduce two assignment mechanisms that we leave implict in the main text: Z is the assignment to treatment and control, and W is a choice over the conditions in the setting such as the characteristics of the units, the research design, and the time and location; the d o ( ) operator is indicated by an arrow inside of the yellow nodes. All of the other nodes are defined in the text. Gray nodes are unobserved or latent and are represented by Greek letters. The blue nodes with an “I” represented by Latin letters are measured variables and hence are outcomes of a measurement process [61], and hence, this figure is consistent with the measurement view of the observed variable bundles. Among the latent nodes, α and γ are “active” ingredients in that they cause the outcome of interest β . The θ vector contains “inert” ingredients in that the nodes do not have any effect, either direct or indirect, on the outcomes β or B , but they can affect the measurement of the observed variable.

$Figure A1 Preferred model for the claim “ α \alpha causes β \beta in γ \gamma ” specified within the causal process (7) of the main text. Gray nodes are unobserved causes, effects, and conditions of interest. Blue nodes with an “I” are measured variables. Yellow nodes with an arrow indicate d o do commands; Z Z is an assignment mechanism, and W W is a choice of conditions in the setting.$

Figure A1

Preferred model for the claim “ α causes β in γ ” specified within the causal process (7) of the main text. Gray nodes are unobserved causes, effects, and conditions of interest. Blue nodes with an “I” are measured variables. Yellow nodes with an arrow indicate d o commands; Z is an assignment mechanism, and W is a choice of conditions in the setting.

The diagram represents the causal process α causes β in γ that we represent in claim (7), i.e., γ is an effect modifier that is necessary for the cause associated with α to occur. An ideal experiment would execute a d o ( α ) procedure, in both the presence and absence of γ , but since α and γ are ontological referents (i.e., events in nature that we ordinarily do not observe directly) such a procedure typically is not possible. Thus, the causal relationship between the relata ( α and β ) and the causal conditions ( γ ) can only be assumed, and the validity of those assumptions depends on their correspondence with the truth that resides in nature.

The DAGs are useful to demonstrate that a strongly valid generalized causal claim based on an observed statistical relationship between A and B requires all of the Greek letter nodes to be specified correctly. To support a weakly valid causal claim, the θ vector does not need to be specified.

Figure A2 shows the DAG representations of (true) causal processes where the (semantic) claim “ α causes β in γ ” lacks validity, i.e., in this figure, the DAGs are not a claim but instead are a representation of the ontological causal process. The left panel of Figure A2(a) shows when the claim “ α causes β in γ ” lacks construct validity of the cause. The right panel A2(b) shows when that claim lacks external validity. Note that in each case, the claim does not match the causal process found in nature, and hence is not valid. Clearly, internal validity that results from a d o ( ) operation on Z or W is not sufficient to ensure a valid causal claim.

$Figure A2 DAG representation of causal processes where the claim “ α \alpha causes β \beta in γ \gamma ” lacks validity: (a) no construct validity and (b) no external validity.$

Figure A2

DAG representation of causal processes where the claim “ α causes β in γ ” lacks validity: (a) no construct validity and (b) no external validity.

The DAG in Figure A1 shows that a valid generalized causal statement is never with respect to the measured variables since this would execute the d o ( ) operator on an outcome of the measurement process (which is a collider variable) rather than on the cause of interest. For example, consider the consequence of erroneously taking the measured variable A to be the actual cause. In this case, placing the d o ( ) operator on A demonstrates that the causal paths are not able to recover the causal effect of interest; since A is itself an outcome, the d o ( ) operator in this case does not send any information along the causal path. Placing the d o ( ) operator on a given node deletes the arrows that point toward the node. Since A is a collider, this results in A simply disconnecting from the graph. The analogous problem occurs when taking C as the necessary conditions instead of γ .

Note that in all of these figures, we are using a d o ( ) operator, and hence, the assumption of internal validity holds in each case, but even then the causal effect cannot be recovered or understood deductively without causal specification.

References

[1] Angrist JD, Pischke JS. The credibility revolution in empirical economics: how better research design is taking the con out of econometrics. J Econ Perspectives. 2010;24(2):3–30. 10.1257/jep.24.2.3Search in Google Scholar

[2] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701. 10.1037/h0037350Search in Google Scholar

[3] Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. New York, N.Y.: Cambridge University Press; 2000. Search in Google Scholar

[4] Card D. Design-based research in empirical microeconomics. Amer Econ Rev. 2022;112(6):1773–81. 10.1257/aer.112.6.1773Search in Google Scholar

[5] Banerjee AV, Duflo E. The experimental approach to development economics. Ann Rev Econ. 2009;1(1):151–78. 10.1146/annurev.economics.050708.143235Search in Google Scholar

[6] Leamer EE. Let’s Take the Con Out of Econometrics. Amer Econ Rev. 1983;73(1):31–43. Search in Google Scholar

[7] LaLonde R. Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review. 1986;76(4):604–20. Search in Google Scholar

[8] Samii C. Causal empiricism in quantitative research. J Politics. 2016;78(3):941–55. 10.1086/686690Search in Google Scholar

[9] Keele L. The statistics of causal inference: A view from political methodology. Politic Anal. 2015;23(3):313–35. 10.1093/pan/mpv007Search in Google Scholar

[10] Gerber AS, Green DP. Field experiments: design, analysis and interpretation. New York, N.Y.: W.W. Norton; 2012. Search in Google Scholar

[11] Morgan SL, Winship C. Counterfactuals and causal inference: methods and principles for social research. 2nd ed. New York, N.Y.: Cambridge University Press; 2015. 10.1017/CBO9781107587991Search in Google Scholar

[12] Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical sciences: an introduction. New York, N.Y.: Cambridge University Press; 2015. 10.1017/CBO9781139025751Search in Google Scholar

[13] Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48. 10.1097/00001648-199901000-00008Search in Google Scholar

[14] Gelman A, Imbens G. Why ask why? Forward causal inference and reverse causal questions. Cambridge, MA: National Bureau of Economic Research; 2013. p. 19614. 10.3386/w19614Search in Google Scholar

[15] Lundberg I, Johnson R, Stewart BM. What is your Estimand? defining the target quantity connects statistical evidence to theory. Amer Sociol Rev. 2021;86(3):532–65. 10.1177/00031224211004187Search in Google Scholar

[16] Keele L, Minozzi W. How much is Minnesota like Wisconsin? assumptions and counterfactuals in causal inference with observational data. Politic Anal. 2013;21(Spring):193–216. 10.1093/pan/mps041Search in Google Scholar

[17] Findley MG, Kikuta K, Denly M. External validity. Ann Rev Politic Sci. 2021;24:1–51. 10.1146/annurev-polisci-041719-102556Search in Google Scholar

[18] Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Boston, Mass.: Cengage Learning; 2002. Search in Google Scholar

[19] Falleti TG, Lynch JF. Context and causal mechanisms in political analysis. Comparative Politic Stud. 2009;42(9):1143–66. 10.1177/0010414009331724Search in Google Scholar

[20] Holland PW. Statistics and causal analysis. J Amer Stat Assoc. 1986;81(Dec.):945–60. 10.1080/01621459.1986.10478354Search in Google Scholar

[21] Gerber AS, Green DP, Larimer CW. Social pressure and voter turnout: evidence from a large-scale field experiment. Amer Politic Sci Rev. 2008;102(Feb.):33–48. 10.1017/S000305540808009XSearch in Google Scholar

[22] Angrist JD, Dynarski SM, Kane TJ, Pathak PA, Walters CR. Who benefits from KIPP? J Policy Anal Manag. 2012;31(4):837–60. 10.1002/pam.21647Search in Google Scholar

[23] Banerjee A, Duflo E, Goldberg N, Karlan D, Osei R, Parienté W, et al. A multifaceted program causes lasting progress for the very poor: Evidence from six countries. Science. 2015;348(6236):1260799–1. 10.1126/science.1260799Search in Google Scholar PubMed

[24] Cartwright N. Natureas capacities and their measurement. New York, NY: Oxford University Press; 1994. 10.1093/0198235070.003.0005Search in Google Scholar

[25] Pearl J, MacKenzie D. The book of why: the new science of cause and effect. New York, N.Y.: Basic Books; 2018. Search in Google Scholar

[26] Slough T. Phantom counterfactuals. Am J Politic Sci. 2022;67(1):137–53.10.1111/ajps.12715Search in Google Scholar

[27] Slough T, Tyson SA. External validity and meta-analysis. Am J Politic Sci. 2022;67(2):440–55. 10.1111/ajps.12742Search in Google Scholar

[28] Ashworth S, Berry CR, Bueno de Mesquita E. Theory and credibility : integrating theoretical and empirical social science. Princeton, NJ: Princeton University Press; 2021. 10.23943/princeton/9780691213828.001.0001Search in Google Scholar

[29] Auspurg K, Brüderl J. Has the credibility of the social sciences been credibly destroyed? Reanalyzing the many analysts, one data set project. Socius. 2021;7. 10.1177/23780231211024421.Search in Google Scholar

[30] Kocher MA, Monteiro NP. Lines of demarcation: causation, design-based inference, and historical research. Perspectives Politics. 2016;14(4):952–75. 10.1017/S1537592716002863Search in Google Scholar

[31] Munger K. Temporal validity as meta-science. Res Politic. 2023;10(3). 10.1177/20531680231187271.Search in Google Scholar

[32] Morton RB, Williams KC. Experimental political science and the study of causality: from nature to the lab. New York, NY: Cambridge University Press; 2010. 10.1017/CBO9780511762888Search in Google Scholar

[33] Egami N, Hartman E. Elements of external validity: framework, design, and analysis. Amer Politic Sci Rev. 2023;117(3):1070–88. 10.1017/S0003055422000880Search in Google Scholar

[34] Bareinboim E, Pearl J. Causal inference and the data-fusion problem. Proc Nat Acad Sci. 2016;113(27):7345–52. 10.1073/pnas.1510507113Search in Google Scholar PubMed PubMed Central

[35] Cronbach LJ. Designing evaluations of educational and social programs. San Francisco: Jossey-Bass Publishers; 1982. Search in Google Scholar

[36] Lewis D. Causation. J Philos. 1973;70:556–67. 10.2307/2025310Search in Google Scholar

[37] Neyman J. Statistical problems in agricultural experimentation. Suppl J R Stat Soc. 1935;2:107–80. 10.2307/2983637Search in Google Scholar

[38] Woodward J. Making things happen: a theory of causal explanation. New York, N.Y.: Oxford University Press; 2004. 10.1093/0195155270.001.0001Search in Google Scholar

[39] Petersen ML, van der Laan MJ. Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiology. 2014;25(3):418–26. 10.1097/EDE.0000000000000078Search in Google Scholar PubMed PubMed Central

[40] Manski C. Identification problems in the social sciences. Cambridge, MA: Harvard University Press; 1995. Search in Google Scholar

[41] Rose MR. A dutiful voice: justice in the distribution of jury service. Law Soc Rev. 2005;39(3):601–34. 10.1111/j.1540-5893.2005.00235.xSearch in Google Scholar

[42] Boatright RG. Why citizens don’t respond to jury summonses and what courts can do about it. Judicature. 1999;82:156–64. Search in Google Scholar

[43] Arceneaux K, Nickerson DW. Comparing negative and positive campaign messages: evidence from two field experiments. Amer Politic Res. 2010;38(Jan.):54–83. 10.1177/1532673X09331613Search in Google Scholar

[44] Gerber AS, Green DP. The effects of canvassing, telephone calls, and direct mail on voter turnout: a field experiment. Amer Politic Sci Rev. 2000;94(Sept.):653–63. 10.2307/2585837Search in Google Scholar

[45] Bowler S, Esterling K, Holmes D. GOTJ: Get out the juror. Politic Behav. 2014;36:515–33. 10.1007/s11109-013-9244-2Search in Google Scholar

[46] Imai K, King G, Stuart EA. Misunderstandings among experimentalists and observationalists about causal inference. J R Stat Soc Ser A (Stat Soc). 2008;171(April):481–502. 10.1111/j.1467-985X.2007.00527.xSearch in Google Scholar

[47] Stokes S. A defense of observational research. In: Teele DL, editor. Field experiments and their critics. New Haven, Conn.: Yale University Press; 2014. p. 33–57. 10.12987/9780300199307-004Search in Google Scholar

[48] Kim J. Causes and events: Mackie on causation. J Philos. 1971;68(14):426–41. 10.2307/2025175Search in Google Scholar

[49] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Amer Stat Assoc. 1996;91(June):444–55. 10.1080/01621459.1996.10476902Search in Google Scholar

[50] Mackie JL. Causes and conditions. Am Philos Quarter. 1965;12:245–65. Search in Google Scholar

[51] Rothman K. Causes. Amer J Epidemiol. 1976;104:587–92. 10.1093/oxfordjournals.aje.a112335Search in Google Scholar PubMed

[52] Paul LA, Hall N. Causation: A user’s guide. Oxford: Oxford University Press; 2013. 10.1093/acprof:oso/9780199673445.001.0001Search in Google Scholar

[53] Heckman JJ. The scientific model of causality. Sociol Methodol. 2005;35(1):1–97. 10.1111/j.0081-1750.2005.00164.xSearch in Google Scholar

[54] Glennan S. The new mechanical philosophy. Oxford: Oxford University Press; 2017. 10.1093/oso/9780198779711.001.0001Search in Google Scholar

[55] Cook TD, Tang Y, Seidman Diamond S. Causally valid relationships that invoke the wrong causal agent: construct validity of the cause in policy research. J Soc Soc Work Res. 2014;5(4):379–414. 10.1086/679289Search in Google Scholar

[56] VanderWeele TJ, Hernán MA. From counterfactuals to sufficient component causes and vice versa. Europ J Epidemiol. 2006;21(12):855–8. 10.1007/s10654-006-9075-0Search in Google Scholar PubMed

[57] Cartwright N. The art of medicine: A philosopheras view of the long road from RCTs to effectiveness. Lancet. 2011;377(9775):1400–1. 10.1016/S0140-6736(11)60563-1Search in Google Scholar PubMed

[58] Schaffer J. Zalta E, Nodelman U, editors. The Metaphysics of Causation. Metaphysics Research Lab, Stanford University; 2022. https://plato.stanford.edu/archives/spr2022/entries/causation-metaphysics/. Search in Google Scholar

[59] Kruglanski AW, Kroy M. Outcome validity in experimental research: a re-conceptualization. Represent Res Soc Psychol. 1976;7(2):166–78. Search in Google Scholar

[60] Kelly TL. Interpretation of educational measurements. Yonkers-on-Hudson, N.Y.: World Book; 1927. Search in Google Scholar

[61] Borsboom D, Mellenbergh GJ, van Heerden J. The concept of validity. Psychol Rev. 2004;111(4):1061–71. 10.1037/0033-295X.111.4.1061Search in Google Scholar PubMed

[62] Feest U. Construct validity in psychological tests - the case of implicit social cognition. Europ J Philos Sci. 2020;10(1):1–24. 10.1007/s13194-019-0270-8Search in Google Scholar

[63] Jiménez-Buedo M. Conceptual tools for assessing experiments: Some well entrenched confusions regarding the internal/external validity distinction. J Econ Methodol. 2011;18(3):271–82. 10.1080/1350178X.2011.611027Search in Google Scholar

[64] Sullivan JA. The multiplicity of experimental protocols: A challenge to reductionist and non-reductionist models of the unity of neuroscience. Synthese. 2009;167(3):511–39. 10.1007/s11229-008-9389-4Search in Google Scholar

[65] Edwards JK, Cole SR, Westreich D. All your data are always missing: Incorporating bias due to measurement error into the potential outcomes framework. Int J Epidemiol. 2015;44(4):1452–9. 10.1093/ije/dyu272Search in Google Scholar PubMed PubMed Central

[66] Dafoe A, Zhang B, Caughey D. Information equivalence in survey experiments. Politic Anal. 2018;26(4):399–416. 10.1017/pan.2018.9Search in Google Scholar

[67] Banerjee AV, Chassang S, Snowberg E. Decision theoretic approaches to experiment design and external validity. Handbook Econ Field Experiments. 2017;1:141–74. 10.1016/bs.hefe.2016.08.005Search in Google Scholar

[68] Campbell DT. Factors relevant to the validity of experiments in social settings. Psychol Bulletin. 1957;54(4):297–312. 10.1037/h0040950Search in Google Scholar PubMed

[69] Guala F. Experimental localism and external validity. Philos Sci. 2003;70:1195–205. 10.1086/377400Search in Google Scholar

[70] Adcock R, Collier D. Measurement validity: A shared standard for qualitative and quantitative research. Amer Politic Sci Rev. 2001;95(3):529–46. 10.1017/S0003055401003100Search in Google Scholar

[71] Sánchez AR. Is it just noise? Measuring unobservable cognitive abilities in early childhood. Personality Individual Differ. 2020;166:110162. 10.1016/j.paid.2020.110162Search in Google Scholar

[72] Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychol Bulletin. 1955;52(4):281–302. 10.1037/h0040957Search in Google Scholar PubMed

[73] Alexandrova A. A philosophy for the science of well being. New York, N.Y.: Oxford University Press; 2017. 10.1093/oso/9780199300518.001.0001Search in Google Scholar

[74] Sartori G. Concept misinformation in comparative politics. Amer Politic Sci Rev. 1970;64(4):1033–53. 10.2307/1958356Search in Google Scholar

[75] Dunning T. Improving causal inference: strengths and limitations of natural experiments. Politic Res Quarterly. 2008;61(June):282–93. 10.1177/1065912907306470Search in Google Scholar

[76] Fong C, Grimmer J. Causal inference with latent treatments. Am J Politic Sci. 2021;67(2):374–89. 10.1111/ajps.12649Search in Google Scholar

[77] Julnes G. Review of experimental and quasi-experimental designs for generalized causal inference. Evaluat Program Plan. 2004;27:173–85. 10.1016/j.evalprogplan.2004.01.006Search in Google Scholar

[78] Cook TD. Generalizing causal knowledge in the policy sciences: external validity as a task of both multiattribute representation and multiattribute extrapolation. J Policy Anal Manag. 2014;33(2):527–36. 10.1002/pam.21750Search in Google Scholar

[79] Guala F. The methodology of experimental economics. New York, N.Y.: Cambridge University Press; 2005. 10.1017/CBO9780511614651Search in Google Scholar

[80] Deaton A, Cartwright N. Understanding and misunderstanding randomized control trials. Soc Sci Med. 2018;210:2–21. 10.1016/j.socscimed.2017.12.005Search in Google Scholar PubMed PubMed Central

[81] Deaton A. Randomization in the tropics revisited: a theme and eleven variations. In: Bedecarrats F, Guerin I, Rouboud F, editors. Randomized control trials in the field of development. New York, N.Y.: Oxford University Press; 2019. p. 29–46. 10.1093/oso/9780198865360.003.0002Search in Google Scholar

[82] Pritchett L, Sandefur J. Context matters for size: why external validity claims and development practice do not mix. J Globalization Development. 2013;4(Dec.):161–97. 10.1515/jgd-2014-0004Search in Google Scholar

[83] Vivalt E. How much can we generalize from impact evaluations? J Europ Econ Assoc. 2020;18(6):3045–89. 10.1093/jeea/jvaa019Search in Google Scholar

[84] Weiss MJ, Bloom HS, Brock T. A conceptual framework for studying the sources of variation in program effects. J Policy Anal Manag. 2014;33(3):778–808. 10.1002/pam.21760Search in Google Scholar

[85] Peters J, Langbein J, Roberts G. Generalization in the tropics - development policy, randomized controlled trials, and external validity. World Bank Res Observer. 2018;33(1):34–64. 10.1093/wbro/lkx005Search in Google Scholar

[86] Henrich J, Heine SJ, Norenzayan A. The weirdest people in the world? Behav Brain Sci. 2010;33(2–3):61–83. 10.1017/S0140525X0999152XSearch in Google Scholar PubMed

[87] Ravallion M. Fighting poverty one experiment at a time. J Econ Literature. 2012;50:103–14. 10.1257/jel.50.1.103Search in Google Scholar

[88] Cartwright N, Hardie J. Evidence-based policy: a practical guide to doing it better. New York, N.Y.: Oxford University Press; 2012. 10.1093/acprof:osobl/9780199841608.001.0001Search in Google Scholar

[89] Abend G. Making things possible. Sociol Meth Res. 2020;51(1):68–107. 10.1177/0049124120926204Search in Google Scholar

[90] Jackson G, Helfen M, Kaplan R, Kirsch A, Lohmeyer N. The problem of de-contextualization in organization and management research. Res Soc Organizations. 2019;59:21–42. 10.1108/S0733-558X20190000059001Search in Google Scholar

[91] Sampson RJ, Winship C, Knight C. Translating causal claims: principles and strategies for policy-relevant criminology. Criminol Public Policy. 2013;12(4):587–616. 10.1111/1745-9133.12027Search in Google Scholar

[92] Pearl J. Sufficient causes: On oxygen, matches, and fires. J Causal Inference. 2019;7(2):20190026. 10.1515/jci-2019-0026Search in Google Scholar

[93] Kern HL, Stuart EA, Hill J, Green DP. Assessing methods for generalizing experimental impact estimates to target populations. J Res Educ Effect. 2016;9(1):103–27. 10.1080/19345747.2015.1060282Search in Google Scholar PubMed PubMed Central

[94] Nguyen TQ, Ebnesajjad C, Cole SR, Stuart EA. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. Ann Appl Stat. 2017;11(1):225–47. 101214/16-AOAS1001. Search in Google Scholar

[95] Humphreys M, Scacco A. The aggregation challenge. World Development. 2020;127:104806. 10.1016/j.worlddev.2019.104806Search in Google Scholar

[96] Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A Stat Soc. 2011;174(2):369–86. 10.1111/j.1467-985X.2010.00673.xSearch in Google Scholar PubMed PubMed Central

[97] Cook T. Causal generalization: how Campbell and Cronbach influenced my theoretical thinking on this topic, including in Shadish, Cook, and Campbell. In: Alkin MC, editor. Evaluation roots. Thousand Oaks, CA: Sage Publications, Inc.; 2012. p. 89–112. 10.4135/9781412984157.n5Search in Google Scholar

[98] Gailmard S. Theory, history and political economy. J Historic Politic Econ. 2021;1(1):69–104. 10.1561/115.00000003Search in Google Scholar

[99] Nosek BA, Errington TM. What is replication? PLOS Biol. 2020;18(3):e3000691. 10.1371/journal.pbio.3000691Search in Google Scholar PubMed PubMed Central

[100] Cook TD. Twenty-six assumptions that have to be met if single random assignment experiments are to warrant -gold standard- status: A commentary on Deaton and Cartwright. Soc Sci Med. 2018;210:37–40. 10.1016/j.socscimed.2018.04.031Search in Google Scholar PubMed

[101] Deaton A, Cartwright N. Reflect randomized control trials. Soc Sci Med (1982). 2018;210:86–90. 10.1016/j.socscimed.2018.04.046Search in Google Scholar PubMed

[102] Craver CF, Darden L. In search of mechanisms: discoveries across the life sciences. Chicago: University of Chicago Press; 2013. 10.7208/chicago/9780226039824.001.0001Search in Google Scholar

[103] Munger K, Guess AM, Hargittai E. Quantitative description of digital media: a modest proposal to disrupt academic publishing. J Quantitative Descript Digital Media. 2021;1:1–13. 10.51685/jqd.2021.000Search in Google Scholar

[104] Barnes J, Weller N. Case studies and analytic transparency in causal-oriented mixed-methods research. Politic Sci Politics. 2017;50(4):1019–22. 10.1017/S1049096517001202Search in Google Scholar

[105] Beach D. Process-tracing methods in social science. Oxford Research Encyclopedia of Politics. Oxford, England: Oxford Univeristy Press; 2017. 10.1093/acrefore/9780190228637.013.176Search in Google Scholar

[106] Collier D. Understanding process tracing. Politic Sci Politics. 2011;44(4):823–30. 10.1017/S1049096511001429Search in Google Scholar

[107] Collier D, Brady HE, Seawright J. Outdated views of qualitative methods: time to move on. Politic Anal. 2010;18(4):506–13. 10.1093/pan/mpq022Search in Google Scholar

[108] Ankel-Peters J, Fiala N, Neubauer F. Do economists replicate? J Econ Behav Organiz. 2023;212:219–32. 10.1016/j.jebo.2023.05.009Search in Google Scholar

[109] Dunning T, Grossman G, Humphreys M, Hyde SD, McIntosh C, Nellis G. Information, accountability and cumulative learning. New York, N.Y.: Cambridge University Press; 2019. 10.1017/9781108381390Search in Google Scholar

[110] Machery E, Knobe J, Stich SP. Editorial: Cultural variation and cognition. Rev Philos Psychol. 2023;14(2):339–47. 10.1007/s13164-023-00687-9Search in Google Scholar

[111] Gilbert DT, King G, Pettigrew S, Wilson TD. Comment on “Estimating the reproducibility of psychological science”. Science. 2016;351(6277):1037. 10.1126/science.aad7243Search in Google Scholar PubMed

[112] Stich SP, Machery E. Demographic differences in philosophical intuition: a reply to Joshua Knobe. Rev Philos Psychol. 2023;14:423–56. 10.1007/s13164-021-00609-7Search in Google Scholar

[113] Muthukrishna M, Henrich J. A problem in theory. Nature Human Behav. 2019;3(3):221–9. 10.1038/s41562-018-0522-1Search in Google Scholar PubMed

[114] Price MA, Weisz JR, McKetta S, Hollinsaid NL, Lattanner MR, Reid AE, et al. Meta-analysis: Are psychotherapies less effective for black youth in communities with higher levels of anti-black racism? J Amer Acad Child Adolescent Psychiatry. 2022;61(6):754–63. 10.1016/j.jaac.2021.07.808Search in Google Scholar PubMed PubMed Central

[115] Price MA, McKetta S, Weisz JR, Ford JV, Lattanner MR, Skov H, et al. Cultural sexism moderates efficacy of psychotherapy: Results from a spatial meta-analysis. Clin Psychol Sci Practice. 2021;28(3):299–312. 10.1037/cps0000031Search in Google Scholar

[116] Pearl J. On the consistency rule in causal inference: axiom, definition, assumption, or theorem? Epidemiology. 2010;21(6):872–5. 10.1097/EDE.0b013e3181f5d3fdSearch in Google Scholar PubMed

Received: 2024-01-10

Revised: 2024-07-21

Accepted: 2024-11-20

Published Online: 2025-02-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/jci-2024-0002

Keywords for this article

causality; validity; deduction; generalization; identification

Creative Commons

BY 4.0