Abstract
Configurational Comparative Methods (CCMs) aim to learn causal structures from datasets by exploiting Boolean sufficiency and necessity relationships. One important challenge for these methods is that such Boolean relationships are often not satisfied in real-life datasets, as these datasets usually contain noise. Hence, CCMs infer models that only approximately fit the data, introducing a risk of inferring incorrect or incomplete models, especially when data are also fragmented (have limited empirical diversity). To minimize this risk, evaluation measures for sufficiency and necessity should be sensitive to all relevant evidence. This article points out that the standard evaluation measures in CCMs, consistency and coverage, neglect certain evidence for these Boolean relationships. Correspondingly, two new measures, contrapositive consistency and contrapositive coverage, which are equivalent to the binary classification measures specificity and negative predictive value, respectively, are introduced to the CCM context as additions to consistency and coverage. A simulation experiment demonstrates that the introduced contrapositive measures indeed help to identify correct CCM models.
1 Introduction
Configurational Comparative Methods (CCMs) aim to learn causal structures from datasets. These methods can infer structures with highly complex causal interactions from relatively small datasets without requiring unconditional dependence between individual causes and their effects. Whereas many standard methods of causal structure learning – most notably Bayesian network methods [1] – rely on the faithfulness assumption or relaxed versions of it and assume pairwise dependence between individual causes and their effects [2,3], CCMs do not rely on any version of faithfulness.
They are widely used in various research fields. For instance, the prominent CCM Qualitative Comparative Analysis (QCA) has been applied in thousands of studies across disciplines like political science, business research, environmental science, information management, and education science. This includes studies on topics such as investments in renewable energy, UN sanctions, technology transfer, educational poverty, national innovation performance, flood risk management, digital transformation strategies, and youth wellbeing. The more recently developed CCM Coincidence Analysis (CNA) has seen a significant uptick in applications in health research, including high impact studies on topics like liver damage, opioid treatment, cancer care, surgical site infection reduction, and obesity treatment.[1]
CCMs aim to derive causally interpretable models from data by exploiting Boolean sufficiency and necessity relationships. One important challenge for these methods is that these Boolean relationships are often not strictly satisfied in real-life datasets, because such datasets usually contain noise. Hence, CCMs infer models that only approximately fit the data, introducing a risk of inferring incorrect or incomplete models, especially when data are also fragmented (have limited empirical diversity). To minimize this risk, evaluation measures for sufficiency and necessity should be sensitive to all relevant evidence: they should reward all evidence in favour of sufficiency and necessity and penalize all counterevidence. Charles Ragin introduced consistency and coverage to evaluate sufficiency and necessity [5,6]. Roughly speaking, consistency and coverage measure how often sufficiency and necessity are satisfied in a dataset. These measures were initially defined for so-called crisp-set variables, which can only take one of two values for each case in the dataset, e.g. 1 or 0, or true or false, and were then generalized for fuzzy-set variables, which can take any real number between and including 0 and 1 as their values.
Consistency and coverage are currently the standard evaluation measures in CCMs. They have been accepted on common sense grounds without explicit argumentation by Ragin or others. In other methodological frameworks, for instance in binary classification [7] and in association rule learning [8], there has been extensive research on evaluation measures and their properties. While some of the measures studied in those frameworks are potential additions to or alternatives for consistency and coverage, no systematic studies have been conducted on which measures would be best suited for evaluating sufficiency and necessity in CCMs. Moreover, with the exception of Goertz [9] and Schneider and Wagemann [10], which will be discussed in Section 4.4, the limited existing research on improving consistency and coverage in CCMs has focused on issues that only arise when generalizing these measures to the fuzzy-set case.
This article studies measures for evaluating sufficiency and necessity in crisp-set, multi-value, and fuzzy-set CCMs, by drawing inspiration from the field of binary classification. I show an analogy between CCMs and binary classification that enables the application of certain insights from binary classification evaluation to CCM evaluation. This inspires the introduction of two new evaluation measures to CCMs, while also paving the way for transferring more tools from binary classification to CCMs. The two introduced additional measures are sensitive to evidence that is neglected by current consistency and coverage. When evaluating the sufficiency of
2 Background
Many disciplines investigate causal structures featuring conjunctivity and disjunctivity. Conjunctivity means that causes are complex bundles that only bring about the outcome when all components of the bundle are instantiated jointly. It is modelled by Boolean conjunctions (
Causal inference methods face severe challenges when used for discovering causal structures with conjunctivity and disjunctivity, because causally related variables are often not pairwise dependent in such structures. For instance, when studying causal structures with a phenotypic expression of genes as the outcome, a specific allele at a specific locus can be uncorrelated with the phenotypic expression under investigation, even though, combined with specific alleles on other loci, it is located on a causal path for the phenotypic expression [14,15].[4] As Bayesian network methods and standard regression methods rely on the faithfulness assumption or relaxed versions of it, they struggle to infer such structures – even from ideal data. There do exist protocols for handling interaction effects with two or three exogenous variables for these methods, but such protocols face multicollinearity issues and have tight computational complexity restrictions [17].
Discovering causal structures featuring conjunctivity and disjunctivity requires a method that does not assume faithfulness and that can model complex causal structures of many individual causes by means of complex Boolean
Suppose that CCMs are used on a dataset with crisp-set variables
Model (1) says that the antecedent is true if and only if (
According to the theory of causation underlying CCMs [20], model (1) represents a true causal structure only if its antecedent is a minimally necessary disjunction of minimally sufficient conjunctions for
Hence, causal interpretability of CCM models requires that each disjunct of the antecedent is sufficient for the outcome and that the antecedent itself is necessary for the outcome. Moreover, each of these sufficiency and necessity relationships must be minimal. These minimality requirements can, in turn, be expressed in terms of the absence of sufficiency and necessity. For instance, the sufficiency of
Sufficiency and necessity are deterministic relationships. Correspondingly, the INUS theory of causation underlying CCMs posits that relationships between causes and their effects are deterministic, meaning that they hold in a dataset if and only if they are satisfied for every case in that dataset. However, these deterministic relationships between causes and their effects do not hold in many real-life datasets due to the presence of noise. Noise refers to cases in a dataset that are incompatible with the true causal structure underlying that dataset, i.e. the data-generating causal structure (DGCS) [21] (p. 530). For instance, in a dataset with model (1) as its DGCS, a noisy case is a case for which antecedent
In addition, deterministic relationships that do not hold in reality may nonetheless hold in some real-life datasets due to the presence of fragmentation. Fragmentation means that not all possible configurations of exogenous variables are included in the dataset. So, in a non-fragmented dataset generated from model (1), every combination of values of
For noise-free, non-fragmented datasets, CCMs are guaranteed to infer results that are compatible with the true DGCS by simply accepting sufficiency and necessity relationships that are always satisfied in the dataset and rejecting such relationships if they are violated at least once. However, CCMs have been developed to infer reliable causal models from real-life datasets, and such datasets usually feature noise and fragmentation. If there is noise in the data, and thus any case violating a sufficiency or necessity relationship may be a noisy case, then rejecting a sufficiency or necessity relationship when there is at least one case violating it is too strict. Furthermore, data fragmentation entails that there may be cases that are compatible with the DGCS and that violate a considered sufficiency or necessity relationship, but that are not included in the dataset. Therefore, the complete absence of cases violating a considered relationship in a dataset is too weak a requirement for accepting that relationship. That is why CCMs use measures like consistency and coverage for evaluating sufficiency and necessity. Such evaluation measures aim to assess how likely these deterministic relationships are to hold in the noise-free, non-fragmented version of a dataset based on the observed version of that dataset, by rewarding cases that satisfy them and penalizing cases that violate them. Good evaluation measures enable CCMs to obtain reliable results even for datasets featuring certain degrees of noise and fragmentation.
I will propose new evaluation measures for CCMs based on evaluation measures in binary classification. The aim of binary classification is to obtain models that predict the value of a crisp-set (binary) outcome variable. Consider, for example, a binary classification model for outcome
3 Model evaluation in binary classification and CCMs
3.1 The confusion matrix
In binary classification, the performance of an antecedent on a dataset is often evaluated by means of the confusion matrix, which groups cases in the dataset into four fields based on whether the antecedent and outcome are positive for them [23]. Let
In the binary classification confusion matrix shown in Table 1(a), the field with
Confusion matrix for binary classification (a) and CCMs (b)
![]() |
The field with
3.2 Evaluation measures
Ratios of cell sizes of the confusion matrix are used as evaluation measures or parameters of fit, as they are usually called in CCMs. In binary classification, the ratio of
The ratios over the columns, called positive predictive value (PPV) and negative predictive value (NPV), are also used for evaluating binary classification models. PPV is the ratio of
Sensitivity and specificity, on the one hand, and PPV and NPV, on the other, are often used as pairs in binary classification, each pair having its own specific purpose. Yet, in some application domains, PPV and sensitivity are considered together instead.[6] Considering PPV in addition to sensitivity for model evaluation instead of using the more standard pair sensitivity and specificity is motivated by low prevalence [24,26–28]. Prevalence is the proportion of cases in the dataset for which the outcome is positive (i.e.
In CCMs, only one pair of evaluation measures is used: consistency and coverage [6]. They too are ratios of cell sizes of the confusion matrix. Consistency is the ratio of
4 Alternative measures for sufficiency and necessity
4.1 Specificity, or contrapositive consistency
Currently, consistency alone is used as a measure for the sufficiency of
Measuring
To see the logical equivalence of
|
|
|
|
---|---|---|---|
1 | 1 | 1 | 1 |
1 | 0 | 0 | 0 |
0 | 1 | 1 | 1 |
0 | 0 | 1 | 1 |
As
This reformulation shows how consistency penalizes
So, contrapositive consistency penalizes
4.2 Example in favour of contrapositive consistency
An example shows that considering contrapositive consistency in addition to consistency can provide extra information about
Table 2(a) shows the confusion matrix for this CCM model for a fictional example dataset.
Confusion matrices for company
![]() |
Compare this to the confusion matrix in Table 2(b) taken from a fictional study on a different company
Now, consider contrapositive consistency for both confusion matrices: in Table 2(a), contrapositive consistency
The example demonstrates that the strength of evidence for
4.3 NPV, or contrapositive coverage
For measuring the necessity of
This makes clear that coverage penalizes
An example analogous to the one given for contrapositive consistency in Section 4.2 demonstrates that cases with
Confusion matrices for company
![]() |
Compare this to the confusion matrix for yet another company
Now, consider contrapositive coverage for both confusion matrices: in Table 3(a), contrapositive coverage
This is analogous to contrapositive consistency, which, as discussed in Section 4.2, is particularly informative for evaluating
4.4 The relevance of
x
∗
y
in CCMs
Before presenting a simulation experiment which demonstrates the usefulness of contrapositive consistency and contrapositive coverage for evaluating CCM models, I show that the existing CCM literature does not contain any objections against these contrapositive measures and that no evaluation measures with the same rationale or specification as contrapositive consistency or contrapositive coverage have been proposed for CCMs before.
Contrapositive consistency and contrapositive coverage depend on
Seawright [32] started a discussion on the relevance of cases with
Goertz [9] introduces his measure with the aim of evaluating the importance of necessary conditions. According to Goertz, this importance can be analyzed partly in terms of trivialness: if there are no or almost no cases with
Schneider and Wagemann connect trivialness to the size of
The motivation for Relevance of Necessity is to some extent similar to my motivation for introducing contrapositive coverage, but there is an important distinction: while Schneider and Wagemann state that every
A few other CCM researchers have formulated critiques of consistency and coverage and, correspondingly, have proposed alternative variants of these measures, e.g. [35–38]. Ragin himself introduces the so-called PRI measure as a “more refined and conservative measure of consistency” [39] (p. 50). These proposals concern problems that arise exclusively for fuzzy-set data, not for crisp-set or multi-value data. My proposal in the current article applies not only to fuzzy-set data but also to crisp-set and multi-value data and points to the fact that consistency and coverage underestimate the evidence in favour of
In sum, I have argued that contrapositive consistency should be considered in addition to consistency for evaluating sufficiency and that contrapositive coverage should be taken into account in addition to coverage for evaluating necessity in CCMs. Examining contrapositive consistency and contrapositive coverage is especially important when
5 Simulation experiment
5.1 Set-up
The purpose of the simulation experiment is to compare the correctness of CCM models that meet certain contrapositive consistency and contrapositive coverage thresholds to the correctness of models that do not meet these thresholds, for high-prevalence datasets. A CCM model is correct if and only if it does not make any false causal relevance ascriptions, meaning that it is a submodel of the DGCS underlying the dataset being analyzed [40] (pp. 8–9). For instance, the model
The experiment consists of a series of 50.000 inverse search trials in which noisy and fragmented datasets are generated from randomly drawn DGCSs, after which a CCM analysis is performed on each of these datasets with the aim of recovering the corresponding DGCS. The formation of the random DGCS in each trial begins by randomly generating a disjunction of conjunctions comprising up to six unique variables and with two to four disjuncts each consisting of two to four conjuncts. This expression is then minimized to a well-formed CCM antecedent, which is paired with an outcome to create the DGCS. Subsequently, a dataset composed of all cases compatible with the DGCS is generated, after which a portion of these cases is removed to simulate fragmentation. The proportion of cases to be removed is sampled uniformly from the interval 0.2–0.5. From the resulting fragmented and noise-free dataset, 100 observations are sampled with replacement, and then a portion of these cases are replaced by cases incompatible with the DGCS to simulate noise. The proportion of cases to be replaced by noisy cases is sampled uniformly from the interval 0.05–0.3. Finally, datasets with prevalence varying from 0.6 to 0.9 in steps of 0.05 are created for each of the 50.000 DGCSs by appropriately duplicating cases with positive or negative outcomes while preserving the proportions of fragmentation and noise. This allows for conducting inverse search trials at systematically varied prevalence. In line with the conclusions presented in Section 4, the experiment only includes prevalence levels of 0.6 and above. Additional research is needed to explore the effectiveness of the contrapositive thresholds on datasets with lower prevalence.
The inverse search trials are conducted using the CCM CNA [16]. To ensure generalizability to CNA studies with various settings, the trials are conducted at systematically varied consistency and coverage thresholds. The consistency threshold specifies the minimum consistency required for accepting a conjunction as sufficient, and the coverage threshold specifies the minimum coverage required for accepting a disjunction of conjunctions as necessary. The consistency and coverage thresholds are varied independently between 0.7 and 0.85 in steps of 0.05, giving rise to 16 different consistency-coverage combinations.
CNA returns all CCM models that meet the specified consistency and coverage thresholds. Its output is empty if no models meet these thresholds. After conducting the CNA analyses, the experiment proceeds by introducing contrapositive consistency and contrapositive coverage thresholds. The contrapositive consistency threshold is set equal to the consistency threshold, and the contrapositive coverage threshold is set equal to the coverage threshold for the given trial. The returned models that meet both contrapositive thresholds, high contrapositive models, are separated from those that do not, low contrapositive models, and the correctness of each resulting group of models is recorded. If one or both groups are empty in a trial, then no correctness is recorded for that group in that trial. For each prevalence-consistency-coverage combination, the correctness per group is averaged across the trials, leading to a comparison of the average correctness of high contrapositive models to that of low contrapositive models per output. This avoids a disproportionate influence of outputs containing many models when testing whether selecting high contrapositive models helps to obtain correct causal models.
To usefully interpret the results of the simulation experiment, it is important to consider not only the correctness of models, but also their complexity and degree of completeness. Correctness refers to the absence of false causal relevance ascriptions, and can thus be achieved more easily by making fewer causal relevance ascriptions, but this can also negatively impact the overall quality of the model by decreasing its ability to fully reflect the causal structure underlying the data. To account for this trade-off, the complexity of the models is recorded as well. Complexity is defined as the number of variable value appearances in the antecedent of the model. For example, model
5.2 Results
Figure 1 shows the average correctness per output of low contrapositive models (black bars) and high contrapositive models (grey bars). In some consistency-coverage-prevalence settings, only very few or no trials resulted in an output containing at least one high contrapositive model. Missing bars in Figures 1, 2, and 3 indicate settings and groups for which the mean is calculated over fewer than 20 trials; they do not represent means that are equal to 0. Standard errors of the means in Figure 1 are between 0 and 0.032 for high contrapositive models and between 0.0006 and 0.0015 for low contrapositive models. Varying prevalence levels are presented on the x-axes within the plots, and consistency thresholds and coverage thresholds are presented in different columns and rows, respectively.

Correctness means (missing bars indicate that mean is based on

Complexity means (missing bars indicate that mean is based on

Completeness means (missing bars indicate that mean is based on
Clearly, high contrapositive models have a much higher correctness than low contrapositive models. When consistency is not much lower than prevalence, virtually all high contrapositive models are correct. Correctness of low contrapositive models reaches a peak at prevalence levels close to the consistency threshold and goes down again as prevalence increases above consistency. Strikingly, the same pattern is not found for high contrapositive models: whereas correctness goes up for these models as prevalence increases towards the consistency threshold, it does not go down again as prevalence increases above consistency. Note also that, while patterns in correctness and prevalence depend substantially on the consistency threshold, they are virtually independent of the coverage threshold.
Figure 2 shows that high contrapositive models have a lower complexity than low contrapositive models. By contrast, Figure 3 illustrates that high contrapositive models do not have a lower degree of completeness than low contrapositive models and that, at high consistency thresholds, the degree of completeness of high contrapositive models is even higher than that of low contrapositive models. This contrast is surprising, since both complexity and degree of completeness are measures of the number of variable value appearances in a model. Whereas complexity simply counts the number of variable value appearances in a model, degree of completeness measures this number in proportion to the number of variable value appearances in the DGCS. So, the observed contrast implies that high contrapositive models tend to be inferred from less complex DGCSs than low contrapositive models in this experiment, which is possible if, for datasets generated from less complex DGCSs, it is more likely that at least one high contrapositive model is inferred. Thus, high contrapositive models make on average fewer causal relevance ascriptions than low contrapositive models, without sacrificing the proportion of the corresponding DGCS that they are able to recover. Finally, Figure 4 presents the numbers of search trials in which at least one low contrapositive model is returned (black bars) and in which at least one high contrapositive model is returned (grey bars). For many settings, only a small number of CNA outputs contain any high contrapositive models.

Number of recorded results.
5.3 Discussion
The decrease in correctness for prevalence levels lower than consistency may be explained as follows: at lower prevalence, more conjuncts are needed to obtain a conjunction that reaches the required consistency threshold, leading to a higher number of causal relevance ascriptions, which makes it generally more likely that at least one of the causal relevance ascriptions is false. Indeed, Figure 2 illustrates that, as prevalence decreases below consistency, model complexity increases. That patterns in correctness and prevalence depend on consistency but are independent of coverage, may, first, be explained by the order of application of consistency and coverage in CNA’s model-building process: only after conjunctions reaching the consistency threshold have been built, are those conjunctions used for building disjunctions reaching the coverage threshold [16]. So, the range of possible models to be built is already severely limited by the consistency constraint before coverage comes into play, making coverage less influential than consistency in the model-building process. Second, as seen in Section 4, high consistency amounts to strong evidence for sufficiency when prevalence is low, and high coverage amounts to strong evidence for necessity when
As argued for in Section 4, correctness decreases at prevalence levels higher than consistency because, for such settings, any variable value is likely to be accepted as sufficient for the outcome. Correctness of high contrapositive models does not decrease at prevalence levels higher than consistency because prevalence levels higher than the (contrapositive) consistency threshold do not increase the likelihood that models reach the contrapositive thresholds regardless of causal relevance. Nonetheless, the size of the difference in correctness between low and high contrapositive models, and the perfect or nearly perfect correctness scores for high contrapositive models when prevalence is at least as high as consistency, remain surprising. One additional explanation for the dramatic difference in correctness is a difference in complexity: for many settings, high contrapositive models make fewer than half as many causal relevance ascriptions as low contrapositive models. Still, as shown in Figure 3, high contrapositive models do not have a lower degree of completeness than low contrapositive models. High contrapositive models are less complex because, on average, less complex DGCSs lead to the discovery of high contrapositive models, not because high contrapositive models give a less complete picture of their corresponding DGCS.
5.4 Limitations
Before concluding, I discuss four limitations of the experiment, and one general potential limitation of contrapositive measures which can be evaluated in light of the results of the experiment. First, the noise in the experiment is generated by uniform random sampling from the cases incompatible with the DGCS. This sampling approach makes the experiment less representative for CCM analyses on datasets in which, due to systematic confounding, not all cases incompatible with the DGCS are contained with equal probability. It is expected that the performance of both the original measures, consistency and coverage, and the proposed measures, contrapositive consistency and contrapositive coverage, would be impaired by the presence of systematic confounding. However, since I do not claim that the proposed contrapositive measures would specifically mitigate the adverse effects of systematic confounding, and since there is no reason for suspecting that the performance of the proposed measures is impaired more by the presence of systematic confounding than the performance of the original measures, the use of uniformly distributed noise should not affect the interpretation of the results of the experiment.
Second, in this experiment, high prevalence levels are achieved by purposefully duplicating cases that have the desired outcome, amounting to sampling-induced prevalence variation. This is not the only way in which high prevalence can occur. For instance, prevalence may be high because the DGCS determines there to be many instances of the outcome, amounting to DGCS-induced prevalence variation. More research would be needed to find out whether the benefits of contrapositive measures for model quality are similar for DGCS-induced high prevalence. Still, as the prevalence of real-world datasets is typically sampling induced to some extent, the findings for sampling-induced high prevalence presented in this article are in any case informative. Third, even though the simulation experiment was conducted using CNA, the use of contrapositive measures for model evaluation should also be applicable to QCA, the most prominent CCM. Fourth, the experiment only includes crisp-set CCM analyses. Nevertheless, as shown in Appendix D, the findings of this article should be extendable to multi-value and fuzzy-set CCM analyses.
In addition, a possible disadvantage of relying on contrapositive consistency in high prevalence scenarios is that in such scenarios, only relatively few cases – those with
6 Concluding remarks
The large increase in correctness and the preservation of the degree of completeness for high contrapositive models entail that, when high contrapositive models are returned, these must be the models of choice. Nevertheless, as shown in Figure 4, only a small number of trials produce any high contrapositive models, implying that there are also many correct models that do not meet the contrapositive thresholds. So, the results of the experiment should not be taken to imply that all low contrapositive models must be discarded. For models that do not meet the contrapositive thresholds, contrapositive consistency and contrapositive coverage should serve as extra model evaluation tools in addition to existing tools such as consistency and coverage, theoretical knowledge and case knowledge [4] (p. 172), and model robustness [40].
The explanations provided in Section 4 allow researchers to correctly interpret contrapositive consistency and contrapositive coverage and to appropriately use them in CCM model evaluation. Contrapositive consistency evaluates whether
The findings of this article yield another recommendation for CCM practitioners: in order to obtain correct models for high-prevalence datasets, it is advisable to set the consistency threshold slightly above prevalence when conducting analyses using conventional consistency and coverage, as Figures 1 and 3 show that such consistency thresholds are most likely to produce correct models with a relatively high degree of completeness. Of course, this recommendation should be weighed against other considerations, such as the suspected proportion of noise in the dataset and the general requirement for reasonably high consistency. Furthermore, if one is willing to increase the risk of false causal relevance ascriptions in exchange for a possible gain in degree of completeness, as is customary in the so-called SI-approach to QCA [41] (p. 1874), increasing consistency further above prevalence may still be the preferred strategy.
Finally, the finding that contrapositive measures are able to select correct models but that only a small number of models produced by CCMs meet these thresholds suggests two approaches for further improving CCMs. The first approach is to develop more fine-grained ways of incorporating the evidence taken into account by contrapositive consistency and contrapositive coverage in model evaluation, as simply imposing contrapositive thresholds set equal to the conventional consistency and coverage thresholds was shown to exclude too many correct models. Lowering the contrapositive thresholds is not an optimal strategy for achieving this, because there is no principled method for deciding to what extent these thresholds should be lowered. A more promising way forward is to develop new measures that combine consistency with contrapositive consistency (for evaluating sufficiency) or that combine coverage with contrapositive coverage (for evaluating necessity).
For instance, the harmonic mean of consistency and contrapositive consistency takes into account more information than either of these measures alone. Moreover, a weighted harmonic mean of consistency and contrapositive consistency allows to assign more or less weight to consistency relative to contrapositive consistency depending on prevalence or
A second approach for improving CCMs would benefit from the development of this first approach. A plausible explanation as to why only very few models in the experiment meet the contrapositive thresholds is that contrapositive measures were only used for model selection in this experiment and not for model building. But CCMs, as currently implemented, only use conventional consistency and coverage for model building and neglect some evidence that contrapositive measures are sensitive to. In order to obtain CCM models with higher contrapositive consistency and contrapositive coverage, this neglected evidence should be taken into account at the stage of model building already. This can be achieved by using a measure that combines consistency and contrapositive consistency in place of consistency and a measure that combines coverage and contrapositive coverage in place of coverage during model building.
Acknowledgements
I thank Michael Baumgartner for his invaluable feedback during the development of this article and the underlying research. In addition, I am grateful to Mathias Ambühl for providing R functions for applying the contrapositive measures. Furthermore, I benefited from discussions with Veli-Pekka Parkkinen, Martyna Swiatczak, and Torfinn Huvenes and from their feedback on earlier drafts. I am also indebted to three anonymous referees for their fruitful comments, and I thank the participants in the 2nd International Conference on Current Issues in Coincidence Analysis, the European PhD Network in Philosophy Seminar, and the University of Bergen Philosophy PhD Seminar for their constructive feedback.
-
Funding information: This work was funded by the Research Council of Norway (grant number 326215).
-
Conflict of interest: The authors state no conflict of interest.
-
Data availability statement: The datasets generated and analyzed during the current study can be reproduced with the R scripts available at https://github.com/Luna-De-Souter/Evaluating-Boolean-relationships-in-CCMs.
Appendix A
∣
X
∣
∕
N
close to
∣
Y
∣
∕
N
This appendix shows that, if consistency and coverage are reasonably high, then
We have
From this, it follows that
This allows to determine the bounds of the ratio
As an example, suppose that we set the minimum thresholds for consistency and coverage both at 0.7. Consistency and coverage can never be higher than 1, so
Because
B Goertz’s measure of nontrivialness from fuzzy to crisp sets
This appendix presents the fuzzy-set formulation of Goertz’s measure of nontrivialness and shows a derivation of the crisp-set variant of this measure. The fuzzy-set measure is defined as follows [9] (p. 95):
Note that the summation only includes cases with
As the summation only includes cases with
C Relevance of Necessity from fuzzy to crisp sets
This appendix presents the fuzzy-set formulation of Schneider and Wagemann’s Relevance of Necessity measure and shows a derivation of the crisp-set variant of this measure. The fuzzy-set measure is defined as follows [10] (p. 236):
D Extension to multi-value and fuzzy set
D.1 Multi-value
This appendix shows that contrapositive measures can be applied to multi-value CCM models in the same way as they are applied to crisp-set models. Whereas crisp-set variables can take one of only two values for each case in the dataset, multi-value variables can take one of more than two values. Suppose that CCMs are used on a dataset with multi-value variables
Consistency can be formulated in the same way for this multi-value CCM model and dataset as for the crisp-set case:
Here,
Likewise, contrapositive consistency can be formulated in the same way for multi-value CCM models as for crisp-set CCM models:
Here,
The formulation of coverage and contrapositive coverage for multi-value CCM models is analogous to the formulation of consistency and contrapositive consistency. So, contrapositive consistency and contrapositive coverage can be extended straight-forwardly from the crisp-set to the multi-value case.
D.2 Fuzzy set
This appendix shows how contrapositive measures can be applied to fuzzy-set CCM models. Fuzzy-set variables can take any real number between and including 0 and 1 as their value in each case of the dataset. Therefore, the classical Boolean operators conjunction, disjunction, negation, and implication are not directly applicable to fuzzy-set variables, introducing the need for fuzzy-logic variants of these Boolean operators. The fuzzy-logic operators standardly used in CCMs are as follows: conjunction
As seen in Section 4.1, the justification for using contrapositive consistency in addition to consistency to evaluate
Hence, the justification for crisp-set contrapositive consistency is also applicable to its fuzzy-set variant. Fuzzy-set contrapositive consistency can be formulated analogously to crisp-set contrapositive consistency, yielding the following additional measure for evaluating
Note that the reformulations of crisp-set consistency and contrapositive consistency given at the end of Section 4.1 to show that both measures penalize
Proof
The proof of this reformulation relies on the following equality:
By subtracting
This allows to reformulate fuzzy-set consistency as follows:
So, fuzzy-set consistency penalizes
Until now, most or all arguments against rewarding cases with both
Consistency penalizes these cases in proportion to
The formulation of coverage and contrapositive coverage for fuzzy-set CCM models is analogous to the fuzzy-set formulations of consistency and contrapositive consistency. So, contrapositive consistency and contrapositive coverage can be extended from the crisp-set to the fuzzy-set case.
References
[1] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search (second edition). Cambridge: The MIT Press; 2000. 10.7551/mitpress/1754.001.0001Suche in Google Scholar
[2] Zhang J, Spirtes P. Detection of unfaithfulness and robust causal inference. Minds and Machines. 2008;18(2):239–71. 10.1007/s11023-008-9096-4Suche in Google Scholar
[3] Spirtes P, Zhang J. A uniformly consistent estimator of causal effects under the k-triangle-faithfulness assumption. Stat Sci. 2014;29(4):662–78. 10.1214/13-STS429Suche in Google Scholar
[4] Oana IE, Schneider CQ, Thomann E. Qualitative Comparative Analysis using R: a beginner’s guide. Methods for social inquiry. Cambridge: Cambridge University Press; 2021. 10.1017/9781009006781Suche in Google Scholar
[5] Ragin CC. Fuzzy-set social science. Chicago: University of Chicago Press; 2000. Suche in Google Scholar
[6] Ragin CC. Set relations in social research: evaluating their consistency and coverage. Political Analysis. 2006;14(3):291–310. 10.1093/pan/mpj019Suche in Google Scholar
[7] Tharwat A. Classification assessment methods. Appl Comput Informatics. 2021;17(1):168–92. 10.1016/j.aci.2018.08.003Suche in Google Scholar
[8] Glass DH. Confirmation measures of association rule interestingness. Knowl Based Syst. 2013;44:65–77. 10.1016/j.knosys.2013.01.021Suche in Google Scholar
[9] Goertz G. Assessing the trivialness, relevance, and relative importance of necessary or sufficient conditions in social science. Stud Comp Int Dev. 2006;41(2):88–109. 10.1007/BF02686312Suche in Google Scholar
[10] Schneider CQ, Wagemann C. Set-theoretic methods for the social sciences: a guide to Qualitative Comparative Analysis. Strategies for social inquiry. Cambridge: Cambridge University Press; 2012. 10.1017/CBO9781139004244Suche in Google Scholar
[11] Rothman KJ. Causes. Am J Epidemiol. 1976;104(6):587–92. 10.1093/oxfordjournals.aje.a112335Suche in Google Scholar PubMed
[12] Gerring J. Causation: a unified framework for the social sciences. J Theor Polit. 2005;17(2):163–98. 10.1177/0951629805050859Suche in Google Scholar
[13] Hart HLA, Honoré T. Causation in the law. Oxford: Oxford University Press; 1985. 10.1093/acprof:oso/9780198254744.001.0001Suche in Google Scholar
[14] Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10(6):392–404. 10.1038/nrg2579Suche in Google Scholar PubMed PubMed Central
[15] Culverhouse R, Suarez BK, Lin J, Reich T. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002;70(2):461–71. 10.1086/338759Suche in Google Scholar PubMed PubMed Central
[16] Ambuehl M, Baumgartner M. cna: Causal modeling with Coincidence Analysis. 2023. R package version 3.5.2. https://CRAN.R-project.org/package=cna. Suche in Google Scholar
[17] Brambor T, Clark WR, Golder M. Understanding interaction models: improving empirical analyses. Political Analysis. 2006;14(1):63–82. 10.1093/pan/mpi014Suche in Google Scholar
[18] Mackie JL. The cement of the universe: a study of causation. Oxford: Clarendon Press; 1974. Suche in Google Scholar
[19] Graßhoff G, May M. Causal regularities. In: Spohn W, Ledwig M, Esfeld M, editors. Current issues in causation. Paderborn: Mentis; 2001. p. 85–114. Suche in Google Scholar
[20] Baumgartner M, Falk C. Boolean difference-making: a modern regularity theory of causation. British J Philos Sci. 2023;74(1):171–9710.1093/bjps/axz047Suche in Google Scholar
[21] Baumgartner M, Ambühl M. Causal modeling with multi-value and fuzzy-set Coincidence Analysis. Political Sci Res Methods. 2020;8(3):526–42. 10.1017/psrm.2018.45Suche in Google Scholar
[22] Baumgartner M, Thiem A. Often trusted but never (properly) tested: evaluating Qualitative Comparative Analysis. Sociol Methods Res. 2020;49(2):279–311. 10.1177/0049124117701487Suche in Google Scholar
[23] Kuhn M, Johnson K. Measuring performance in classification models. In: Applied predictive modeling. New York: Springer; 2013. p. 247–73. 10.1007/978-1-4614-6849-3_11Suche in Google Scholar
[24] Siblini W, Fréry J, He-Guelton L, Oblé F, Wang YQ. Master your metrics with calibration. In: Berthold MR, Feelders A, Krempl G, editors. Advances in intelligent data analysis XVIII. Cham: Springer International Publishing; 2020. p. 457–69. 10.1007/978-3-030-44584-3_36Suche in Google Scholar
[25] Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int J Machine Learn Technol. 2011;2(4):37–63. Suche in Google Scholar
[26] Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. 10.1371/journal.pone.0118432Suche in Google Scholar PubMed PubMed Central
[27] Flach P, Kull M. Precision-recall-gain curves: PR analysis done right. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, editors. Advances in neural information processing systems. vol. 28. New York: Curran Associates, Inc.; 2015. Suche in Google Scholar
[28] Koyejo OO, Natarajan N, Ravikumar PK, Dhillon IS. Consistent binary classification with generalized performance metrics. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, editors. Advances in neural information processing systems. vol. 27. New York: Curran Associates, Inc.; 2014. Suche in Google Scholar
[29] Hempel CG. Studies in the logic of confirmation (I.). Mind. 1945;54(213):1–26. 10.1093/mind/LIV.213.1Suche in Google Scholar
[30] Swiatczak MD. Towards a neo-configurational theory of intrinsic motivation. Motivation and Emotion. 2021;45(6):769–89. 10.1007/s11031-021-09906-1Suche in Google Scholar
[31] Csikszentmihalyi M. Beyond boredom and anxiety. San Francisco: Jossey-Bass Publishers; 1975. Suche in Google Scholar
[32] Seawright J. Testing for necessary and/or sufficient causation: which cases are relevant? Political Analysis. 2002;10(2):178–93. 10.1093/pan/10.2.178Suche in Google Scholar
[33] Clarke KA. The reverend and the ravens: comment on Seawright. Political Analysis. 2002;10(2):194–7. 10.1093/pan/10.2.194Suche in Google Scholar
[34] Braumoeller BF, Goertz G. Watching your posterior: comment on Seawright. Political Analysis. 2002;10(2):198–203. 10.1093/pan/10.2.198Suche in Google Scholar
[35] Haesebrouck T. Pitfalls in QCA’s consistency measure. J Comparative Politics. 2015;8(2):65–80. Suche in Google Scholar
[36] Stoklasa J, Luukka P, Talášek T. Set-theoretic methodology using fuzzy sets in rule extraction and validation - consistency and coverage revisited. Inform Sci. 2017;412–413:154–73. 10.1016/j.ins.2017.05.042Suche in Google Scholar
[37] Veri F. Coverage in fuzzy set Qualitative Comparative Analysis (fsQCA): a new fuzzy proposition for describing empirical relevance. Comparative Sociology 2018;17(2):133–158. 10.1163/15691330-12341457Suche in Google Scholar
[38] Veri F. Aggregation bias and ambivalent cases: a new parameter of consistency to understand the significance of set-theoretic sufficiency in fsQCA. Comparative Sociology 2019;18(2):229–55. 10.1163/15691330-12341496Suche in Google Scholar
[39] Mendel JM, Ragin CC. fsQCA: dialog between Jerry M. Mendel and Charles C. Ragin. USC-SIPI REPORT # 411. 2nd edition. 2012. https://ssrn.com/abstract=2517966. Suche in Google Scholar
[40] Parkkinen VP, Baumgartner M. Robustness and model selection in configurational causal modeling. Sociol Methods Res. 2023;52(1):176–208.10.1177/0049124120986200Suche in Google Scholar
[41] Haesebrouck T, Thomann E. Introduction: causation, inferences, and solution types in Configurational Comparative Methods. Quality & Quantity. 2022;56:1867–88. 10.1007/s11135-021-01209-4Suche in Google Scholar
© 2024 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.
Artikel in diesem Heft
- Research Articles
- Evaluating Boolean relationships in Configurational Comparative Methods
- Doubly weighted M-estimation for nonrandom assignment and missing outcomes
- Regression(s) discontinuity: Using bootstrap aggregation to yield estimates of RD treatment effects
- Energy balancing of covariate distributions
- A phenomenological account for causality in terms of elementary actions
- Nonparametric estimation of conditional incremental effects
- Conditional generative adversarial networks for individualized causal mediation analysis
- Mediation analyses for the effect of antibodies in vaccination
- Sharp bounds for causal effects based on Ding and VanderWeele's sensitivity parameters
- Detecting treatment interference under K-nearest-neighbors interference
- Bias formulas for violations of proximal identification assumptions in a linear structural equation model
- Current philosophical perspectives on drug approval in the real world
- Foundations of causal discovery on groups of variables
- Improved sensitivity bounds for mediation under unmeasured mediator–outcome confounding
- Potential outcomes and decision-theoretic foundations for statistical causality: Response to Richardson and Robins
- Quantifying the quality of configurational causal models
- Design-based RCT estimators and central limit theorems for baseline subgroup and related analyses
- An optimal transport approach to estimating causal effects via nonlinear difference-in-differences
- Estimation of network treatment effects with non-ignorable missing confounders
- Double machine learning and design in batch adaptive experiments
- The functional average treatment effect
- An approach to nonparametric inference on the causal dose–response function
- Review Article
- Comparison of open-source software for producing directed acyclic graphs
- Special Issue on Neyman (1923) and its influences on causal inference
- Optimal allocation of sample size for randomization-based inference from 2K factorial designs
- Direct, indirect, and interaction effects based on principal stratification with a binary mediator
- Interactive identification of individuals with positive treatment effect while controlling false discoveries
- Neyman meets causal machine learning: Experimental evaluation of individualized treatment rules
- From urn models to box models: Making Neyman's (1923) insights accessible
- Prospective and retrospective causal inferences based on the potential outcome framework
- Causal inference with textual data: A quasi-experimental design assessing the association between author metadata and acceptance among ICLR submissions from 2017 to 2022
- Some theoretical foundations for the design and analysis of randomized experiments
Artikel in diesem Heft
- Research Articles
- Evaluating Boolean relationships in Configurational Comparative Methods
- Doubly weighted M-estimation for nonrandom assignment and missing outcomes
- Regression(s) discontinuity: Using bootstrap aggregation to yield estimates of RD treatment effects
- Energy balancing of covariate distributions
- A phenomenological account for causality in terms of elementary actions
- Nonparametric estimation of conditional incremental effects
- Conditional generative adversarial networks for individualized causal mediation analysis
- Mediation analyses for the effect of antibodies in vaccination
- Sharp bounds for causal effects based on Ding and VanderWeele's sensitivity parameters
- Detecting treatment interference under K-nearest-neighbors interference
- Bias formulas for violations of proximal identification assumptions in a linear structural equation model
- Current philosophical perspectives on drug approval in the real world
- Foundations of causal discovery on groups of variables
- Improved sensitivity bounds for mediation under unmeasured mediator–outcome confounding
- Potential outcomes and decision-theoretic foundations for statistical causality: Response to Richardson and Robins
- Quantifying the quality of configurational causal models
- Design-based RCT estimators and central limit theorems for baseline subgroup and related analyses
- An optimal transport approach to estimating causal effects via nonlinear difference-in-differences
- Estimation of network treatment effects with non-ignorable missing confounders
- Double machine learning and design in batch adaptive experiments
- The functional average treatment effect
- An approach to nonparametric inference on the causal dose–response function
- Review Article
- Comparison of open-source software for producing directed acyclic graphs
- Special Issue on Neyman (1923) and its influences on causal inference
- Optimal allocation of sample size for randomization-based inference from 2K factorial designs
- Direct, indirect, and interaction effects based on principal stratification with a binary mediator
- Interactive identification of individuals with positive treatment effect while controlling false discoveries
- Neyman meets causal machine learning: Experimental evaluation of individualized treatment rules
- From urn models to box models: Making Neyman's (1923) insights accessible
- Prospective and retrospective causal inferences based on the potential outcome framework
- Causal inference with textual data: A quasi-experimental design assessing the association between author metadata and acceptance among ICLR submissions from 2017 to 2022
- Some theoretical foundations for the design and analysis of randomized experiments