Clinical vs. statistical significance: considerations for clinical laboratories

Hamit Hakan Alp; Mai Thi Chi Tran; Corey Markus; Chung Shun Ho; Tze Ping Loh; Rosita Zakaria; Brian R. Cooke; Elvar Theodorsson; Ronda F. Greaves

doi:10.1515/cclm-2025-0219

Artikel Open Access

Clinical vs. statistical significance: considerations for clinical laboratories

Hamit Hakan Alp , Mai Thi Chi Tran , Corey Markus , Chung Shun Ho , Tze Ping Loh , Rosita Zakaria , Brian R. Cooke , Elvar Theodorsson und Ronda F. Greaves

Veröffentlicht/Copyright: 8. April 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Clinical Chemistry and Laboratory Medicine (CCLM) Band 63 Heft 8

Abstract

Amongst the main perspectives when evaluating the results of medical studies are statistical significance (following formal statistical testing) and clinical significance. While statistical significance shows that a factor’s observed effect on the study results is unlikely (for a given alpha) to be due to chance, effect size shows that the factor’s effect is substantial enough to be clinically useful. The essence of statistical significance is “negative” - that the effect of a factor under study probably did not happen by chance. In contrast, effect size and clinical significance evaluate whether a clinically “positive” effect of a factor is effective and cost-effective. Medical diagnoses and treatments should never be based on the results of a single study. Results from numerous well-designed studies performed in different circumstances are needed, focusing on the magnitude of the effects observed and their relevance to the medical matters being studied rather than on the p-values. This paper discusses statistical inference and its relevance to clinical importance of quantitative testing in clinical laboratories. To achieve this, we first pose questions focusing on fundamental statistical concepts and their relationship to clinical significance. The paper also aims to provide examples of using the methodological approaches of superiority, equivalence, non-inferiority, and inferiority studies in clinical laboratories, which can be used in evidence-based decision-making processes for laboratory professionals.

Keywords: method evaluation; equivalence studies; statistical significance; confidence intervals; clinical significance

Introduction

Statistical theory and analysis are firmly embedded in daily clinical laboratory practices. Examples include the evaluation of measuring system performances using reference materials and internal quality control materials by estimating central tendency (mean) and dispersion (standard deviation and/or coefficient of variation). Additionally, further statistical tests may be applied when changing measurement procedures or modifying method parameters, such as t-tests with associated p-values or confidence intervals (CIs) around the regression coefficient for slope and intercept from a comparison study. Moreover, laboratories may also undertake specific investigations with study designs that are purely hypothesis-driven. Each of these examples is probability theory-based, and the challenge of linking statistical outliers to clinical significance remains for many clinical laboratories.

This paper discusses statistical inference and its relevance to clinical importance in quantitative testing in clinical laboratories. To achieve this, we first pose questions focusing on fundamental statistical concepts and their relationship to clinical significance. The paper also aims to provide examples of using the methodological approaches of superiority, equivalence, non-inferiority, and inferiority studies in clinical laboratories, which can be used in evidence-based decision-making processes for laboratory professionals.

Theoretical and historical background of statistical inference

Question 1: What are the common factors and variables in clinical laboratories?

Variables in clinical laboratories consist of the results of examinations [1] (qualitative) or measurements [2] (quantitative). Multiple factors, including biological/biochemical factors and confounders, may influence the variability in outcomes. Outcome variables are used to examine or measure the effects of different factors. Factors may also be clinical, e.g., a decision to measure plasma troponin for diagnosis, prognosis, and treatment effects on myocardial infarction. In laboratory studies, scientists typically control one factor in isolation or several factors simultaneously in multifactorial designs to investigate whether statistically supported evidence indicates that the factor(s) under investigation influence patient or clinical results (the variables).

Discussion

The effect of a factor on a variable is statistically significant when the observed effect is unlikely to have occurred by random chance alone. In contrast, examination or measurement results are considered clinically significant when the average effect is substantial enough to be fit for the intended use, cost-effective, and/or favored by the patients – each of which is strongly influenced by evidence-based medicine [3].

Statistical methods assess the effects of factors on variables and estimate the probability that any observed effects are due to chance in hypothesis testing (e.g., t-tests, multivariate analysis). The primary hypothesis that is tested, termed the null hypothesis, assumes there is “no effect” or “no difference” beyond what is expected by chance. Importantly, statistical significance depends on three interrelated conditions:

Sample size: Larger sample sizes reduce the standard deviation of the arithmetic mean (by a factor of √n), enhancing the detection of statistically significant changes.
Variability (imprecision) in the variable(s): The smaller the variability, the easier it is to demonstrate statistical significance.
Effect size (The mean/median difference in variable values between the groups): Larger differences in variable values between groups (effect size) make statistical significance easier to demonstrate.

Question 2: Is the null hypothesis significance testing (NHST) paradigm relevant for method evaluation studies in clinical laboratories?

The Null Hypothesis Significance Testing (NHST) paradigm is a widely used statistical method in clinical studies to assess whether sufficient evidence supports a scientific claim or treatment effect [4]. In NHST, scientists start with a null hypothesis (H ₀), which typically states no effect or difference between groups. Conversely, an alternative hypothesis (H _a) proposes that there is an effect or difference between groups. When data from clinical studies are analyzed, a p-value is calculated to demonstrate the probability of getting the observed results if the null hypothesis is true. A small p-value (typically below a threshold such as 0.05) suggests that the null hypothesis can be rejected in favor of the alternative. In clinical trials, NHST helps determine whether a new treatment is statistically more effective than a conventional one or whether a biomarker is significantly associated with a diagnosis/prognosis. However, NHST is not always used in method comparison studies because it focuses on finding statistically significant differences, which may not be clinically meaningful. Instead, approaches like Bland-Altman plots or regression analyses are preferred as they assess how well the two methods agree and whether any differences matter in practice [4], [5], [6].

Discussion

In the best of worlds, the philosophy of science and the methods of inferential statistics would provide tools to “positively” establish the probability that a particular scientific hypothesis is true. However, no such philosophy or corresponding statistical tools are available. The currently favored method of statistical inference is the NHST paradigm, which does the “negative” opposite. It poses a no-effect “null hypothesis” and tests whether the observed data conforms. With complex origins, NHST was first introduced as a significance test by William Sealy Gosset [7] and Ronald Fisher [8], 9]. This is followed by Jerzy Neyman [10], Egon Pearson, and Abraham Wald [11], who introduced tests of acceptance, incorporating concepts of alpha and beta error and decision functions. The NHST paradigm currently represents a subsequent unstandardized combination of these approaches [12], [13], [14], [15], [16]. Hence, whilst NHST is commonly encountered in clinical research, particularly in trials comparing interventions between two patient populations, it is not always considered part of method comparison studies in clinical laboratories.

The relevance of NHST in method evaluation studies within clinical laboratories depends on the study’s objectives. These studies (e.g., comparing a new diagnostic test to a reference method) prioritize agreement and bias assessment rather than testing for statistically significant differences. In other words, method evaluation focuses on agreement, not just differences, making NHST less relevant in many cases.

A p-value alone does not assess clinical acceptability. A p-value below the conventional threshold (e.g., p<0.05) may indicate a statistically significant difference that is not clinically significant. Instead, alternative statistical approaches are preferred. For example, Bland-Altman analysis assesses agreement by analyzing bias and limits of agreement rather than relying on p-values [17]. Also, regression analysis (e.g., Deming or Passing-Bablok regression) helps evaluate bias between methods [18], [19], [20]. Additionally, correlation study results may also be misleading in method comparison studies as a correlation does not demonstrate equivalence. NHST can be misleading as statistical testing relies on an a priori sample size estimation, calibrated to study power. A large sample size may detect tiny, clinically irrelevant differences. In contrast, a small sample size may fail to detect meaningful discrepancies, leading to a false assumption of agreement between methods. Similarly, relying on p-values has limitations, such as the potential for false-positive results and overlooking clinical relevance, highlighting the importance of combining NHST with CIs and effect size measures [4], 6]. However, NHST remains appropriate in comparative studies where explicit differences between methods are tested (e.g., testing mean differences using t-tests or analysis of variance) or when a formal hypothesis about method performance is required (e.g., when a new assay must demonstrate equivalence to an existing standard within predefined limits).

While power calculations are rarely emphasized in laboratory practice, sample size considerations are routinely discussed. Similarly, p-values and CIs derived from probability functions are routinely used to support statistical interpretation. These estimates help determine whether an observed effect is unlikely due to random chance alone.

There are variations in the criteria used to determine statistical significance. For example, when testing the null hypothesis, conventional thresholds such as p<0.05 or p<0.01, or CIs of 95 % or 99 %, are commonly used.Additionally, power calculations are used to determine the minimum sample size required to achieve 80 % power, minimizing the risk of incorrectly rejecting the null hypothesis, whereas suggested sample sizes are often referenced in several method evaluation guidelines. These concepts are directly relevant to method evaluation studies. Although the null hypothesis is not explicitly referred to in such studies, having an a priori acceptance or rejection criteria is considered best practice for all evaluation studies.

Question 3: What was the original intention of the p-value?

Fisher’s original contribution to the NHST paradigm was the development of methods for computing the probability of observing a result at least as extreme as a given test statistic, assuming that a null hypothesis of no effect is true. He also introduced the thresholds of p<0.05 and p<0.01. However, Fisher emphasized the importance of interpreting larger or smaller p-values in the context of the research hypothesis and study design. He advocated using test statistics and p-values to decide whether repeated or additional experiments should be performed. He emphasized that no single experiment is sufficient to establish a scientific claim.

Discussion

Misunderstandings surrounding p-values are widespread, and even Fisher could not clearly explain the relation between p-values and valid scientific conclusions. He proposed an informal system where the p-value should serve as a rough guide of the strength of evidence against the null hypothesis. Fisher also introduced the term “significant” to describe small p-values as results worthy of attention. Still, he did not use concepts such as “rejection of hypotheses,” “power,” or “error rates.” p-Values were initially intended as a flexible and evidential tool to be used within the specific context of a given problem.

Question 4: What are common misconceptions of p-values?

There are several common misinterpretations of p-values and statistical significance, which include but are not limited to the following points:

Statistical significance is not a measure of clinical significance. The null hypothesis is the claim that the effect of the factor being studied does not exist. If the evidence indicates that the null hypothesis is true, any experimentally observed effect is due to chance alone. The rejection of the null hypothesis lends probabilistic support to the opposed “alternative hypothesis” but can never prove it.
p-V alues are not measures of effect size. It is commonly wrongly assumed that the effect must be extensive if the p-value is small. This is not necessarily the case since the p-value results from the combination of effect size, the sample size (n), and the variation/imprecision. Therefore, if the effect is small but the n is large, and the imprecision is slight, p will be small and statistically significant. Effect sizes are optimally reported with a measure of imprecision, preferably by CIs.
A non-significant result does not make the null effect the most likely. A non-significant difference means that the null hypothesis of no effect is statistically consistent with the observed results and the interval of effects included in the CI. This subtle yet crucial point is often overlooked due to insufficient statistical power.
A p-value is not a measure of replicability. A statistically significant result at, for example, p<0.01 does not mean that there is a 99 % chance the results would be replicated if the experiment was repeated, as has been evident in the current crisis of the non-replicability of scientific studies [21], 22]. Using better-justified alpha levels could improve statistical inferences and increase the efficiency and informativeness of scientific research.
Statistical inference is “only” a model of reality. Statistical inference aims to model the results of actual scientific studies. Research designs may be flawed, including e.g., bias, lack of control, and lack of randomization. Therefore, statistical models do not always reflect reality [23], 24].
“If p<0.05, the null hypothesis has less than a 5 % chance of being true” is a false conclusion. When the p-value is calculated in the NHST paradigm, the null hypothesis is supposed to be true. Therefore, the null hypothesis cannot simultaneously be false [14]. It is, however, confirmed that if p<0.05, the chance is less than 5 % that the actual difference could occur by chance alone.
“p>0.05 means that there is no difference between the groups” is a false conclusion. A non-significant difference means the null hypothesis and all other effects in the confidence interval are possible. The difference observed in the study is the most probable effect, given the results, not the possibility of no difference [14].

The most prevalent and severe misconception of p-value is the belief that it alone can determine the probability of an erroneous conclusion from a single experiment without considering any supporting evidence or the plausibility of the underlying mechanisms. This misconception is equivalent to claiming that the magnitude of the effect is irrelevant and that the current experimental results represent the only relevant evidence for the scientific conclusion taken directly from the statistical results. However, all the diverse fathers and mothers of the NHST likely agree that the evidence from any study must be tested in other similar studies and combined with the results of prior studies and other relevant evidence to generate a conclusion. Unfortunately, this understanding seems to have been lost.

Question 5: What should clinical laboratories be aware of to ensure the appropriate use and interpretation of p-values?

Prompted by a growing concern about the misuse and misinterpretation of p-values, the American Statistical Association (ASA) issued the following six statements regarding the significance and p-values in 2016 [25]:

“p-Value shows the extent of data incompatibility with the stated statistical model.
p-Value is neither the measure of the probability of the studied hypothesis being true nor the representation of the probability that study data were produced by random chance alone.
It is extremely important to note that any business model, policy decision, or conclusion related to any scientific study or experiment should not be based on the p-value and merely on whether it passes a specific threshold or not.
It is the moral duty of the authors and scientists to report the research or experimental findings to their full extent and with transparency.
A p-value neither represents the importance of research results nor is it the representation of the effect size of the study.
p-Value does not give a sufficient measure of evidence regarding a model or “hypothesis”.”

Discussion

While the p-value remains relevant, particularly for initial assessments, it should not be the sole basis for decision-making in method evaluation studies. This principle is often under-appreciated in clinical laboratories, where method evaluations are sometimes regarded as “once-and-done” exercises. In reality, method evaluation is an iterative process that provides an early snapshot of performance rather than a definitive conclusion.

The most reliable evidence for method evaluation emerges from numerous well-designed studies performed under different circumstances. Combining these studies through a well-conducted meta-analysis offers the highest level of evidence. Furthermore, the magnitude and clinical relevance of the observed effects should take precedence over p-values. When evaluating clinical laboratory methods, clinical context and practical implications are far more meaningful than strictly adhering to statistical thresholds. Methods should be assessed based on Analytical Performance Specifications (APS) that reflect real-world patient needs, using biologically meaningful thresholds rather than arbitrary statistical cutoffs. Sensitivity, specificity, and agreement metrics, such as Bland-Altman plots or Passing-Bablok regression, provide valuable insights into a method’s diagnostic accuracy [17], 19], 20]. Furthermore, methods should add value to patient care without being unnecessarily expensive or difficult to implement. Cost-effectiveness should also be evaluated to ensure methods improve patient outcomes efficiently. Method validation should be performed across diverse patient populations, and clinical experts should be involved in interpreting results. Continuous evaluation of methods post-implementation ensures they remain relevant and effective in clinical settings.

Importantly, the laboratory must assess the normality of data distribution to determine the appropriate statistical method applicable to estimate a p-value. Often laboratory data for method comparison studies is skewed, and the use of appropriate non-parametric tests is required.

Question 6: What is the confidence interval (CI), and how is it used for method evaluation studies?

A CI is an interval around a measurement that quantifies the likely uncertainty associated with a statistical estimate. It represents the interval within which the true value of the parameter is expected to fall, given a specific level of confidence (commonly 95 %, denoted as CI 0.95). In method evaluation studies, CIs assess whether calculated statistical results suggest no difference or indicate a significant difference. Common applications include:

Bland-Altman analysis: The CI of the mean difference is examined to determine whether it includes zero. If it does, this supports no systematic bias between the two methods. The CI of the limits of agreement helps assess the imprecision and potential range of disagreement between methods [17].
Regression analysis (e.g., Passing-Bablok regression): For the slope, if the CI includes 1, it suggests a proportional equivalence between methods. If the CI includes 0 for the intercept, it supports no constant bias between methods. A statistical difference is inferred if the CI does not include zero (for intercept) or one (for slope). It then becomes critical to evaluate whether this difference is clinically significant, as statistical significance does not necessarily imply clinical relevance [19], 20].

Discussion

A CI is a statistical tool that quantifies the uncertainty associated with an estimate, offering a more nuanced interpretation than a single-point estimate. In method evaluation studies, CIs are critical for determining whether observed differences between methods are statistically significant. However, even when statistical differences are detected (e.g., CIs excluding 1 for slope or 0 for intercept), it is crucial to assess their clinical significance. Statistical significance alone does not imply clinical relevance, underscoring the importance of contextual interpretation in the clinical laboratory.

Inferences made in research studies often rely on statistics such as p-values and CIs to assess whether an observed effect or difference is likely due to random chance alone and a range of possible values the true parameter may fall within. While they are methodologically related, they remain distinct concepts providing different insights. The p-value does not measure effect size; it only estimates the probability of obtaining results as extreme as those observed, assuming that the null hypothesis is correct. By convention, statistical significance is based on a priori arbitrarily chosen thresholds such as 0.05. The use of 0.05 corresponds to Fisher’s 1925 publication that observations greater than two standard deviations are formally regarded as significant and would require unnecessary follow-up of negative results once in every 22 trials/studies [26]. In contrast, a CI provides an estimated range within which the true population parameter (e.g., mean difference, odds ratio) is expected to fall based on the sampled data. CIs provide guidance on the magnitude of an effect and its precision. A narrow CI suggests greater precision, while a wider CI indicates a higher uncertainty. Clinical interpretation is better assessed using CIs since they help determine whether an observed difference is meaningful in practice. General guidance on the theoretical relationship and interpretation of p-values and CIs is in the case where a 95 % CI excludes zero (for differences) or 1 (for ratios); the p-value will likely be <0.05, indicating statistical significance. Conversely, where the 95 % CI includes the null hypothesis value (either 0 or 1, dependent on application), the p-value will be greater than 0.05 and not statistically different. The best practice when reporting p-values and CIs is to include both values and avoid an over-reliance on arbitrary thresholds, ensuring a more comprehensive and meaningful interpretation of results. Achieving statistical significance should not be the sole focus of laboratory scientists; the size of the effect, ratio, or difference should be critically examined through the lens of clinical significance.

Question 7: How is clinical significance tested?

(a) No criteria can “positively” establish clinical significance

Hume [27] and later, for example, Popper [28] showed that no criteria, not even criteria of excellent research quality, can “positively” establish scientific causations, including clinical significance. Therefore, reliance on “negative” probabilistic inductive inference using statistical tests [29] is necessary to avoid false conclusions in studies of clinical significance, as elucidated above. However, scientific studies in general and clinical studies in particular are not equal regarding the quality of scientific evidence and in contributing to the foundations of establishing causation. Early attempts to improve clinical studies’ design and evaluation included the “nine viewpoints” published by Hill, later called “Hill criteria for causation” [30]. The importance of rational study design, evaluation of results, and decision-making by field experts working with statistical expertise have been emphasized in medicine and other fields of knowledge crucial for society [31].

Clinical studies ultimately evaluate the clinical significance of measurement results, studying the effects of the examination or measurement results on patient outcomes. The term “clinically significant” refers to the results of studies that assess the medical effects, cost-effectiveness, and patient value of measurement results in patient populations [32]. Controlled clinical studies and meta-analyses have been developed to encourage authors and journals to include the required elements of high-quality clinical studies, e.g., the CONSORT guidelines for reporting randomized controlled trials data [33], [34], [35], [36], the QUOROM guidelines for reporting systematic reviews and meta-analyses [37] and the STROBE guidelines for reporting observational studies in epidemiology [38], 39]. In the clinical laboratory, the STARD Initiative [40], [41], [42] and REMARK Recommendations [43] are two important frameworks developed to improve the quality and transparency of scientific reporting in specific research areas.

(b) The evaluation of biomarkers

International and national regulators are decisive in permitting the marketing of measuring systems. Clinical research performed by laboratory professionals is the cornerstone of the fitness for the intended use evaluation of biomarkers, a focus since the dawn of clinical laboratories [44], 45]. It is known as APS and translates patient-related quality measures into clinically meaningful criteria [46]. In their current form – the Milan criteria [47] – APS primarily includes measures of biological variation and the effect of measurement results on clinical outcomes [48], 49]. Several influential publications by Bossuyt, Lijmer, and others at the beginning of the 21st century detailed how clinical studies of biomarkers should be performed [50], [51], [52], [53], [54], [55], [56] from experience in epidemiology and general evidence-based medicine, which substantially influenced APS development.

(c) Level of evidence when establishing clinical significance

Prospective studies that use systematic randomization and include appropriate controls carry more evidence than studies that do not apply these principles (Table 1). The following ranking system was initially proposed to assess the strength of evidence in clinical studies [57], 58].

Several publications [59], [60], [61] and books [62], 63] have subsequently elucidated this field of evidence-based medicine, including the GRADE system [64], 65].

The CONSORT guidelines emphasize the use of CIs over p-values. Since the CI depicts the imprecision of the estimates on which the inferences are based, they also depict differences that do not meet conventional statistical significance levels. Even such differences may support clinical differences [66].

(d) The Fryback and Thornbury model (FT-model)

In 1991, Fryback and Thornbury published a six-level hierarchical model that described the medical efficacy of diagnostic imaging [67] (Table 2). It is also well-suited for other examinations and measurements, including in the clinical laboratory. The model combines metrological characteristics and diagnostic properties with the effects on patients and society.

It is a tall order for measurement and examination results to fulfill all six Fryback and Thornbury model levels. When new pharmaceuticals are tested, they are expected to fulfill levels four and five, and sometimes level six, but not all six. Measurement results are commonly used for several indications, making their evaluation especially demanding [68].

Question 8: What are superiority, equivalence, non-inferiority, and inferiority study designs?

As mentioned above, NHST is commonly used in clinical research, particularly in trials comparing interventions between two patient populations. However, in clinical laboratories, it is not always considered standard practice. As a result, superiority, equivalence, inferiority, and non-inferiority estimations are not routinely applied in all clinical laboratory evaluations. Nevertheless, regulatory agencies such as the Food and Drug Administration (FDA) provide detailed guidance on comparing new treatment methods against placebos or standard therapeutic methods. A similar structured approach may also be employed for evaluating laboratory test methods, ensuring robust and standardized validation procedures [69]. In this context, the new treatment in randomized controlled trials (RCTs) can be conceptually translated into laboratory medicine as the new method. In contrast, the placebo or standard treatment can be considered analogous to the existing or standard method in clinical laboratories.

Discussion

The theoretical basis of such designs is a further development of the tests of acceptance introduced by Jerzy Neyman, Egon Pearson, and Abraham Wald, as mentioned above [70], 71]. Essentially, the four types of study designs have different a priori premises [72]:

Superiority studies – designed to establish that a new method or measurement procedure is statistically and clinically better than an existing one
Equivalence studies – designed to verify that a new method performs similarly to an existing method’s performance within certain predefined limits
Inferiority studies – designed to determine if the effect of a new method is statistically significantly worse than a reference method or an existing method
Non-inferiority studies – designed to determine if the performance of a new method is not significantly worse than that of a reference method or an existing method within a predefined acceptable margin

In method comparison studies, it is challenging to state definitively that one method is statistically superior to another. Often, due to the nature of classical statistical analyses used in these studies, the focus is on demonstrating that a new method is equivalent to an established one rather than proving it superior. This type of investigation is known as an equivalence study. Therefore, a different analytical approach, such as a superiority study, is required to determine if one method is better.

Superiority, equivalence, inferiority, and non-inferiority estimations can be performed from a statistical or clinical perspective. Clinical evaluation of new measurement procedures against established or reference measurement procedures is essential. In clinical studies, the results must be performed accurately, at low costs, and with the minimum number of samples or subjects. For this reason, power analyses, including sample size estimation, are essential to avoid both Type I errors (false positive results) [73] and Type II errors (false negative) [74]. Notably, a statistically significant result in a single study does not necessarily mean that the results are clinically significant. Generalization of studies is also an essential factor, and studies should be extended and performed in different environments and contexts to determine whether one treatment is clinically significantly different from another [72].

Question 9: How do the hypotheses differ between superiority, equivalence, inferiority, and non-inferiority estimations?

Table 3 compares typical hypothesis structures for superiority, equivalence, inferiority, and non-inferiority estimations applied to the clinical laboratory. The type of study to choose will depend on the methodological question that the clinical laboratory is trying to answer.

Table 1:

The hierarchy of evidence from clinical studies as published by Greenhalgh [57].

Level of evidence	Evidence from
I	Systematic reviews and meta-analyses
II	Randomized controlled trials with definitive results (confidence intervals that do not overlap the threshold clinically significant effect)
III	Randomized controlled trials with non-definitive results (a point estimate that suggests a clinically significant effect but with confidence intervals overlapping the threshold for this effect)
IV	Cohort studies
V	Case control studies
VI	Cross-sectional surveys
VII	Case reports

Table 2:

The Fryback and Thornbury hierarchical model of test efficiency expressed for Clinical laboratory [75].

Level	General characteristics of the diagnostic test	Properties of the test
1	Technical efficacy	The selectivity of the measurement- or examination results
2	Diagnostic accuracy efficiency	The sensitivity and specificity of the measurement- or examination results The area under the ROC-curve The positive and negative predictive values of the measurement- or examination results
3	Diagnostic thinking efficacy	Do the measurement or examination results aid in diagnosing and monitoring treatment effects? Do the measurement- or examination results influence the pretest estimate of the probability of a specific disease or in evaluating the recurrence of a disease?
4	Therapeutic efficacy	Do the measurement- or examination results aid treatments or treatment plans?
5	Patient outcome efficacy	Are the measurement- or examination results of subjective- or objective benefit to the patients?
6	Social efficacy	The cost-effectiveness of the measurement- or examination results for the patient and/or society

Table 3:

Comparison of typical hypotheses for superiority, equivalence, inferiority, or non-inferiority estimations in clinical laboratories.

Study type	Null hypothesis (H ₀)	Alternative hypothesis (H ₁)
Superiority	The new method is not better than the existing method. This is usually expressed as the effectiveness of the new method being equal to or worse than the existing method. i.e. H ₀ :µ ₁ − µ ₀ ≤δ	The new method is better than the existing method. This means that the new method is statistically significantly superior to the existing method in terms of accuracy, efficacy, or another relevant measure. i.e. H ₁ :µ ₁ − µ ₀ >δ
Equivalence	The difference between the new and existing methods exceeds or equals the acceptable margin, δ. This can be expressed as the new method being significantly worse or significantly better than the existing method beyond the acceptable margin of equivalence. i.e. H ₀ :µ ₁ − µ ₀ ≥ δ	The difference between the new method and the existing method is less than the predefined margin, δ; this shows that the performance of the new method is equivalent to the existing method. i.e. H₁:µ₁ − µ₀< δ
Inferiority	The new method is not worse than the existing method. This hypothesis states that the effect of the new method is equal to or better than the existing method within a predetermined margin. i.e. H₀:µ₁≥µ₀	The new method is worse than the existing one and is beyond the acceptable margin. This implies that the new method has significantly worse outcomes compared to the control, thus confirming its inferiority. i.e. H ₁ : µ ₁ <µ ₀
Non-inferiority	The new method is worse than the existing method by more than the predefined non-inferiority margin. This hypothesis assumes that the difference between the methods exceeds the acceptable threshold. i.e. H₀: µ₁<µ₀ – δ	The new method is not worse than the existing method within the predetermined non-inferiority margin. This indicates that the new method has comparable or acceptable outcomes relative to the existing method. i.e. H₁: µ₁≥µ₀ – δ

To choose between these study designs, it is worth considering the theoretical basis of each.

Superiority studies can be used when introducing a new method to improve accuracy, efficiency, or diagnostic performance in laboratory medicine. In clinical trials, superiority studies focus on proving that a new treatment is statistically and clinically superior to an existing standard [72], 76]. One of the key issues in this type of study design for clinical trials is determining the appropriate sample size. To do this effectively, a minimum clinically significant difference should be determined before the study begins, and the sample size should be calculated based on this difference. However, in clinical laboratory settings, this approach must be adapted. In laboratory method comparison studies (especially for determining the required sample size), the acceptable clinical distance delta (δ) (i.e., the true difference in means between two groups) may be used instead of the minimum clinically significant difference. For superiority studies to be used in method comparison studies, bias must be determined by one of the following methods: (a) analysis of reference materials, (b) recovery experiments using spiked samples, or (c) comparison with results obtained with another method [77]. The bias of each method is calculated separately.

For example, a new method (Method A) and an existing method (Method B) can be compared by calculating Bias A and B (Figure 1). The obtained data are evaluated with an appropriate hypothesis test, for example, the paired t-test. If a statistically significant difference is shown, Bias A and B are significantly different. If Bias A is closer to the true value than Bias B, Method A is considered superior (Figure 1A). Thus, the new method not only shows a statistical difference but also significantly improves trueness.

Figure 1:

Graphical representation of the four study designs. (A) Graphical representation of mean differences with confidence intervals for superiority studies. The mean difference represents the average difference between bias A and bias B. Thick gray line = allowable bias indicating acceptable measurement error. The dashed line = δ (delta) represents the predefined superiority margin. Thin gray line = zero point, indicating no difference between groups. (B) Graphical representation of equivalence, non-inferiority, and inferiority studies. The dashed line δ (delta) represents the predefined equivalence margin. Thin gray line = zero point, indicating no difference between groups.

Equivalence studies can be designed to verify that a new method performs similarly to an existing method’s performance within certain predefined limits. These studies may be performed to determine whether a new method is equivalent in efficiency to an existing method when it offers advantages such as lower cost, improved safety, or greater availability. In equivalence studies, scientists must specify a predefined specific margin (denoted as δ) within which a new method must fall to be considered equivalent [78]. This margin must be carefully chosen to ensure that any clinically meaningful difference between the two methods can be detected. The objective is to show that the difference between the new and existing methods lies entirely within the range of - δ to +δ, indicating that the methods are statistically equivalent [79]. The new method can be used instead of the existing one without compromising accuracy (Figure 1B). This type of study can be performed in the clinical laboratory to determine whether the difference between the diagnostic information provided by the new and the reference or existing method is within a predetermined statistical margin, given the uncertainty of the results. Suppose the difference between the two methods is within the previously determined margins, then. The new method can be used interchangeably with the reference or existing method without compromising diagnostic accuracy. Equivalence studies are beneficial when the new method offers advantages such as lower cost, faster turnaround time, or greater ease of use without compromising accuracy.

Inferiority studies can be designed to determine if the effect of a new method is statistically poorer than a reference or an existing method (Figure 1B). Such studies are less common but may be necessary when evidence of inferiority is essential, such as demonstrating the inadequacy of a widely used but potentially inaccurate analytical method or confirming the limitations or errors associated with a specific measurement procedure [72]. Hypothesis formulation in inferiority studies is specifically designed to determine if there is a significant difference (usually with a defined margin) in the new method’s performance compared to the existing method. Inferiority is shown when performance is outside (negative direction) of the desired margin. To manage ethical implications, inferiority studies require careful consideration, particularly in clinical settings. The decision to conduct such a study must be supported by strong justification, particularly considering the possible consequences of demonstrating that a new method is of inferior quality. This may lead to discontinuing a potentially harmful or less effective method. Rejecting the null hypothesis indicates that the new method is inferior, meaning it does not meet the performance standards of the reference method. These studies are essential when evaluating whether a new method should be adopted or rejected based on its diagnostic performance or whether an existing method should be retired from clinical use.

Non-inferiority studies can be designed to determine if the performance of a new method is not significantly worse than that of a reference method or an existing method within a predefined acceptable margin (Figure 1B). These studies are commonly used in clinical laboratories to evaluate whether a new analytical method can provide results comparable to those of an established method while offering potential advantages such as reduced costs, improved efficiency, or easier implementation. Hypothesis formulation in non-inferiority studies is specifically structured to determine if the new method falls within the acceptable range of performance when compared to the existing method. Rejecting the null hypothesis in a non-inferiority study confirms that the new method is not significantly worse than the control method within the acceptable margin, thereby supporting its adoption. These studies are critical in clinical and diagnostic research, especially when introducing a new method with additional practical benefits beyond efficacy. These studies can be used to demonstrate that a new method is no worse than an existing method in terms of analytical performance, especially if it has additional practical benefits beyond effectiveness, for example, at a low cost.

Non-inferiority studies aim to show that the new test is no worse than the current or gold standard test, whereas equivalence studies aim to show that the new test is no better or worse. It can be considered that there is no difference between non-inferior and equivalence studies in terms of their application to the clinical laboratory. Still, a study without an equivalent can be non-inferior. For example, when two methods are compared, there is a statistically significant difference between the two methods (not equivalent), the differences between both methods may be within the clinical margin, or the 95 % CI may cover the clinical margin (non-inferior), in which case the two methods are not equivalent but non-inferior.

Examples of implementation of superiority, equivalence, inferiority, and non-inferiority studies in clinical laboratories

Comparing the analytical performance of two methods in clinical laboratories involves assessing key characteristics such as accuracy, precision, linearity, sensitivity, specificity, and overall agreement. This process includes designing studies with comparable samples and conditions and evaluating parameters like bias, coefficient of variation, regression analysis, and total analytical error (TAE). Statistical methods such as t-tests, Bland-Altman plots, Passing-Bablok regression, and Deming regression are employed to compare results [7], [17], [18], [19], [20]. The methods are then evaluated against APS based on biological variation or regulatory guidelines. Practical considerations, including clinical relevance, cost, and operational feasibility, further guide decision-making. For example, comparing glucose measurement methods using metrics such as bias, precision, and agreement can help determine the more suitable method for clinical use, provided both meet predefined standards. This systematic approach ensures reliable and clinically meaningful results. Superiority, Equivalence, Inferiority, and Non-inferiority Studies can be implemented in an analytical performance comparison of the two methods.

This section provides examples of the implementation of each of the four study types with a step-by-step guide to compare the trueness (accuracy) of the two methods.

(a) Scenario for a superiority study

To determine whether a new glucose measurement method (Method A) yields superior results compared to an established method (Method B), the bias of Method A can be compared to that of Method B. This comparison will evaluate whether the bias of the new method is statistically closer to the true value, representing a novel approach to assessing method superiority (Figure 2A).

Figure 2:

Graphical representation of superiority, equivalence, non-inferiority, and inferiority approaches in method comparison studies. (A) Superiority. The normal distribution curves represent the distribution of measured values for each method (Method A and Method B). The solid black lines indicate the means, and the grey areas represent the 95 % confidence intervals (CIs). The figure also shows the difference between Method A and the true value (bias A) and between Method B and the true value (bias B). (B) Equivalence. The Figure illustrates the concept of an equivalence study using the 95 % CI of the mean difference between the two methods, compared to predefined equivalence margins (±0.5). Here, the normal distribution curve represents the distribution of the mean difference. (C) Non-inferiority. Where the normal distribution curve reflects the distribution of the mean difference between two HbA_1c measurement methods. The predefined non-inferiority margin is set at −0.5 %. For Method B to be considered non-inferior to Method A, the upper limit of the 95 % CI must lie within this margin. In this example, the CI remains above −0.5 %, indicating that Method B is not significantly worse than Method A and can be considered non-inferior. (D) Inferiority. This Figure demonstrates an inferiority scenario, where the normal distribution curve represents the distribution of the mean difference between the two measurement methods. The 95 % CI lies entirely below the non-inferiority margin set at −0.5, suggesting that the new method is statistically worse than the established method, as the CI does not reach zero or stay within an acceptable range.

Before commencing the study, it is essential to calculate the required sample size. The following data are needed to perform this calculation:

True Value: Reference materials are essential to detect bias, but optimal reference materials are not always available, especially when the measured quantity cannot be precisely defined. Comparison with a reference method can also estimate bias [80]. This example assumes a true glucose concentration of 5.6 mmol/L.

Important Allowable Difference (Margin): The maximum acceptable difference between the two clinically relevant methods is 2.3 % [81]. The allowable margin is estimated using the following formula:

A b s o l u t e B i a s = P e r c e n t B i a s 100 * T r u e v a l u e

A b s o l u t e B i a s = 2.3 100 * 5.6

A b s o l u t e B i a s = ± 0.13

Statistical Power: The probability of detecting a true difference between the methods when it exists, typically set at 80 % or 90 % to minimize Type II errors.

Significance Level (Alpha): The threshold for determining statistical significance, commonly set at 0.05, representing a 5 % chance of a Type I error.

Standard Deviation: From historical data or preliminary studies, an estimate of the variability in plasma glucose measurements across the population can be derived, e.g., a standard deviation of 0.15 mmol/L (σ) from a previous study [82].

The Z-values for the standard normal distribution are:

Z_α/2=1.96 for a two-tailed test at α=0.05

Z β = 0.84 for β = 0.2

The required sample size is calculated as follows:

N = 2 x σ 2 * Z α / 2 + Z β 2 δ 2

N = 2 x 0.15 2 * 1.96 + 0.84 2 0.13 2

N = 2 * 0.025 * 7.84 0.017

N = 21

Hypothesis: Method A is superior to Method B, meaning that Method A provides results closer to the true value.

Null Hypothesis (H₀ ): The bias of Method A (difference from the true value) is greater than or equal to the bias of Method B.

H ₀: BiasA≥BiasB

Alternative Hypothesis (H ₁ ): The bias of Method A is less than that of Method B.

H ₁: BiasA<BiasB

Data analysis

After the reference material is divided into 21 sample aliquots and measured for glucose levels with both methods, the mean bias is calculated as follows:

M e a n B i a s A = ∑ M e t h o d A G l u c o s e − T r u e V a l u e N u m b e r o f s a m p l e s

M e a n B i a s B = ∑ M e t h o d B G l u c o s e − T r u e V a l u e N u m b e r o f s a m p l e s

A t-test or other appropriate statistical test (e.g., observation of overlapping 95 % CI) is used to compare the mean bias. The goal is to determine whether the bias of Method A is statistically smaller than that of Method B. If the 95 % CI for the difference in biases falls below zero, indicating that Method A has a significantly smaller bias than Method B, Method A is considered superior. If the 95 % CI includes zero or the bias of Method A is not significantly smaller, Method A cannot be considered superior (Figure 1A).

(b) Scenario for an equivalence study

A comparison study will be conducted between two glycated hemoglobin (HbA_1c) measurement methods. Method A represents the currently established HbA_1c measurement method in the laboratory, while Method B is the new HbA_1c measurement method under evaluation. Before starting the study, it is essential to determine the sample size. The necessary data for this calculation includes:

Clinically Important Difference (Margin): The maximum acceptable difference between the two clinically significant methods is ±0.5 % [83].

Statistical Power: The probability of detecting a true difference between the methods when it exists, typically set at 80 % or 90 % to minimize Type II errors.

Significance Level (Alpha): The threshold for determining statistical significance, commonly set at 0.05, representing a 5 % chance of a Type I error.

Standard Deviation: An estimate of the variability in HbA_1c measurements within the population is derived from historical data or preliminary studies. A standard deviation of 0.9 % from a previous study [31] can be used.

The Z-values for the standard normal distribution are:

Z_α/2=1.96 for a two-tailed test at α=0.05

Z β = 0.84 for β = 0.2

The required sample size is calculated as follows:

N = 2 * σ 2 * Z α / 2 + Z β 2 δ 2

N = 2 * 0.9 2 * 1.96 + 0.84 2 0.5 2

N = 2 * 0.81 * 7.84 0.25

N = 38

Hypothesis: Method A is equivalent to Method B within the predefined margin of ±0.5 %.

Null Hypothesis (H ₀ ): The mean difference in HbA_1c measurements between Method A and Method B is outside the equivalence margin of ±0.5 %.

H ₀: ∣MeanA−MeanB∣ ≥0.5 %

Alternative Hypothesis (H ₁ ): The mean difference in HbA_1c measurements between Method A and Method B is within the equivalence margin of ±0.5 %.

H ₁: ∣MeanA−MeanB∣ <0.5 %

Data analysis

HbA_1c measurement is performed on 38 independent blood samples using both methods. The mean difference is calculated from the results obtained, and the 95 % CI is determined together with the distribution of this mean difference.

The two methods are equivalent if the 95 % CI falls entirely within ±0.5 % (Figure 2B). Equivalence cannot be confirmed if the 95 % CI extends beyond ±0.5 %.

(c) Scenario for a non-inferiority study

In this example, the comparison involves two methods for measuring HbA_1c, similar to the previous equivalence study. The goal is to demonstrate that any statistical difference between Method B and Method A falls within a clinically acceptable margin. Specifically, if the difference between the two methods is statistically significant but remains within the predefined clinical margin, Method B can be considered non-inferior to Method A. Before starting the study, it is crucial to determine the required sample size. The necessary information for this calculation includes:

Clinically Important Difference (Margin): The maximum clinically acceptable difference between the two methods is 0.5 % [83].

Statistical Power: The probability of detecting a true difference between the methods when it exists, typically set at 80 % or 90 % to minimize Type II errors.

Significance Level (Alpha): The threshold for determining statistical significance, commonly set at 0.05, representing a 5 % chance of a Type I error.

The non-inferiority studies aim to show that a new method is not worse than an existing method within a predefined one-sided margin. In this case:

The Z-values for the standard normal distribution are:

Zα=1.65 for a one-tailed test at α=0.05.

Z β = 0.84 for β = 0.2

The required sample size is calculated as follows:

N = σ 2 x Z α + Z β 2 δ 2

N = 0.9 2 x 1.65 + 0.84 2 0.5 2

N = 0.81 x 6.2 0.25

N = 20

Hypothesis: Method B is non-inferior to Method A.

Null Hypothesis (H ₀ ) : The mean difference in HbA_1c measurements between Method A and Method B is greater than or equal to 0.5 % , indicating that Method B is inferior.

H ₀: Mean Difference = (MeanB− Mean A) ≤−0.5 %

Alternative Hypothesis (H ₁ ): The mean difference in HbA_1c measurements between Method A and Method B is less than 0.5 %, indicating that Method B is not inferior to Method A.

H ₁: Mean Difference = (Mean B- Mean A) >−0.5 %

Data analysis

After measuring HbA_1c levels in 20 independent blood samples using both methods, the mean difference between the two methods is calculated, and the 95 % CI for the mean difference is also determined. To decide about non-inferiority, the lower limit of the 95 % CI must be greater than −0.5 %, the specified non-inferiority margin of −0.5 %. If the lower bound of the 95 % CI is greater than −0.5 %, then Method B is considered non-inferior to Method A (Figure 2C). If the lower bound of the 95 % CI is less than −0.5 %, non-inferiority cannot be concluded.

(d) Scenario for an inferiority study

The primary purpose of inferiority studies is to show that a method, which may be a new method to be established in the laboratory, is inferior to an established method in terms of performance and cost-effectiveness. From a practical perspective, inferiority studies are rarely conducted in clinical laboratories, as method comparison studies typically aim to show that a new method performs equally or better than an existing method. However, showing that one method is inferior to another may be important in cases where the new method is suspected of producing incorrect results; proving its inferiority is crucial to prevent potentially harmful consequences for patient care. Suppose a non-inferiority or equivalence study has already been conducted, and the results show that the new method is either equal or non-inferior to the existing method. In that case, inferiority is automatically ruled out. Conversely, if the observed difference between the two methods exceeds the predefined clinical margin, and the 95 % CI does not include equivalence or non-inferiority, the new method can be conclusively considered inferior (Figure 2D).

Take home messages

Understand the distinction between statistical and clinical significance and be aware of common misconceptions regarding p-values and their interpretation. Consider using confidence intervals when interpreting study results.
Familiarize with different types of studies (superiority, equivalence, inferiority, and non-inferiority) and recognize the importance of a priori power calculations and sample size determination in study design.
Understand the hierarchy of evidence in clinical studies and its relevance to clinical laboratories. Consider the Fryback and Thornbury model when evaluating the efficacy of diagnostic tests.
Be cautious when interpreting results from a single study and recognize the value of meta-analyses and systematic reviews. Consider analytical performance specifications and clinical outcomes when evaluating new methods or tests.
Appreciate the complexity of establishing clinical significance and the need for multiple well-designed studies to support conclusions.
When reporting or interpreting study results, follow established guidelines such as CONSORT, STARD, or REMARK as appropriate, and remember that statistical tools are models of reality, not reality itself.

Summary

Method evaluation in the clinical laboratory is a complex procedure involving assessing both analytical and clinical aspects. During this process, statistical tools determine whether a measurement procedure’s analytical and clinical performance meets acceptable standards. When evaluating a laboratory assay, it is essential to distinguish between statistical and clinical significance, as they address different aspects of the assay’s utility. Specifically, in a clinical context, it is crucial to differentiate between statistical significance – whether the results are real – and clinical significance – whether the results are meaningful for patient care.

NHST is not widely used in clinical laboratories, but its four study designs – superiority, equivalence, inferiority, and non-inferiority – are increasingly applied by regulatory agencies. These studies can be evaluated from statistical or clinical perspectives, each with distinct advantages and limitations. Each of these study designs serves a distinct purpose, and their selection depends on the clinical laboratory’s objectives and clinical requirements. The focus and challenge for the clinical laboratory is to move from the traditional concepts of measurement technology to considering clinical design in the development and implementation of method evaluation studies.

Corresponding author: Ronda F. Greaves, Associate Professor, Murdoch Children’s Research Institute, Melbourne, VIC, Australia; and Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia, E-mail: ronda.greaves@mcri.edu.au

Hamit Hakan Alp and Mai Thi Chi Tran share first authorship.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors are members of the IFCC WG-MEP. The paper was initially developed as part of the writing group of three members, ET, MT and HA, led by ET where the literature was reviewed and initial content ideas were explored. The other authors then developed the structure, wrote sections of the manuscript, edited the final versions and reviewed the entire content. The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interests: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Not applicable.
Disclaimer: Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the IFCC.

References

1. Nordin, G, Dybkaer, R, Forsum, U, Fuentes-Arderiu, X, Pontet, F. Vocabulary on nominal property, examination, and related concepts for clinical laboratory sciences (IFCC-IUPAC Recommendations 2017). Pure Appl Chem 2018;90:913–35.10.1515/pac-2011-0613Suche in Google Scholar

2. Mari, L, Maul, A, Torres Irribarra, D, Wilson, M. Quantities, quantification, and the necessary and sufficient conditions for measurement. Measurement 2017;100:115–21.10.1016/j.measurement.2016.12.050Suche in Google Scholar

3. Jenicek, M. Foundations of evidence-based medicine clinical epidemiology and beyond, 2nd ed. Boca Raton: CRC Press; 2021:420 p.Suche in Google Scholar

4. Szucs, D, Ioannidis, JPA. When null hypothesis significance testing is unsuitable for research: a reassessment. Front Hum Neurosci 2017;11:390.10.3389/fnhum.2017.00390Suche in Google Scholar PubMed PubMed Central

5. Jensen, AL, Kjelgaard-Hansen, M. Method comparison in the clinical laboratory. Vet Clin Pathol 2006;35:276–86.10.1111/j.1939-165X.2006.tb00131.xSuche in Google Scholar

6. Sterne, JA, Davey Smith, G. Sifting the evidence-what’s wrong with significance tests? BMJ 2001;322:226–31.10.1136/bmj.322.7280.226Suche in Google Scholar PubMed PubMed Central

7. Student. The probable error of a mean. Biometrika 1908;6:1–25.10.2307/2331554Suche in Google Scholar

8. Fisher, RA. Statistical tests of agreement between observation and hypothesis. Economica 1923:139–47.10.2307/2548482Suche in Google Scholar

9. Fisher, RA. Statistical methods for research workers, 11th ed. rev ed. Edinburgh: Oliver & Boyd; 1925.Suche in Google Scholar

10. Neyman, J, Pearson, ES. On the use and interpretation of certain test criteria for purposes of statistical inference part I. Biometrika 1928;20A:175–240.10.1093/biomet/20A.1-2.175Suche in Google Scholar

11. Wald, A. Statistical decision functions. Oxford, England: Wiley; 1950:179 p.Suche in Google Scholar

12. Cumming, G, Calin-Jageman, R. Introduction to the new statistics: estimation, open science, and beyond. New York: Routledge, Taylor and Francis Group; 2017:560 p.10.4324/9781315708607Suche in Google Scholar

13. Dienes, Z. Understanding psychology as a science : an introduction to scientific and statistical inference. Basingstoke: Palgrave Macmillan; 2008.Suche in Google Scholar

14. Goodman, S. A dirty dozen: twelve p-value misconceptions. Semin Hematol 2008;45:135–40.10.1053/j.seminhematol.2008.04.003Suche in Google Scholar PubMed

15. Harlow, LL, Mulaik, SA, Steiger, J. What if there were no significance tests? New York: Routledge, Taylor and Francis Group; 2016:395 p.10.4324/9781315629049Suche in Google Scholar

16. Ziliak, ST, McCloskey, DN. The cult of statistical significance: how the standard error costs us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press; 2008.10.3998/mpub.186351Suche in Google Scholar

17. Bland, JM, Altman, DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135–60.10.1191/096228099673819272Suche in Google Scholar

18. Deming, WE. Statistical adjustment of data: dover books on mathematics. Mineola, New York, United States: Dover Publications; 2011:288 p.Suche in Google Scholar

19. Passing, H, Bablok, W. Bablok. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Application of linear regression procedures for method comparison studies in clinical chemistry, Part I. J Clin Chem Clin Biochem 1983;21:709–20.10.1515/cclm.1983.21.11.709Suche in Google Scholar PubMed

20. Passing, H, Bablok, W. Comparison of several regression procedures for method comparison studies and determination of sample sizes. Application of linear regression procedures for method comparison studies in Clinical Chemistry, Part II. J Clin Chem Clin Biochem 1984;22:431–45.10.1515/cclm.1984.22.6.431Suche in Google Scholar PubMed

21. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 2016;533:452–4.10.1038/533452aSuche in Google Scholar PubMed

22. Ioannidis, JP. Why most published research findings are false. PLoS Med 2005;2:e124.10.1371/journal.pmed.0020124Suche in Google Scholar PubMed PubMed Central

23. Amrhein, V, Trafimow, D, Greenland, S. Inferential statistics as descriptive statistics: there is no replication crisis if we don’t expect replication. Am Statistician 2019;73:262–70.10.1080/00031305.2018.1543137Suche in Google Scholar

24. Hasan, U. Statistical and practical significance of articles at sports biomechanics conferences. Annals of Applied Sport Science 2021;9.10.52547/aassjournal.947Suche in Google Scholar

25. Greenland, S, Senn, SJ, Rothman, KJ, Carlin, JB, Poole, C, Goodman, SN, et al.. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016;31:337–50.10.1007/s10654-016-0149-3Suche in Google Scholar PubMed PubMed Central

26. Kennedy-Shaffer, L. Before p < 0.05 to beyond p < 0.05: using history to contextualize p-values and significance testing. Am Stat 2019;73:82–90.10.1080/00031305.2018.1537891Suche in Google Scholar PubMed PubMed Central

27. Hume, DA. Treatise of human nature. Oxford: Clarendon Press; 1896:435 p.Suche in Google Scholar

28. Popper, KR. The logic of scientific discovery. London and New York: Routledge, in Taylor and Francis e-Library; 2005.10.4324/9780203994627Suche in Google Scholar

29. Acree, MC. The myth of statistical inference. Switzerland: Springer Nature; 2021.10.1007/978-3-030-73257-8Suche in Google Scholar

30. Hill, AB. The environment and disease: association or causation? Proc Roy Soc Med 1965;58:295–300.10.1177/003591576505800503Suche in Google Scholar PubMed PubMed Central

31. Boumans, M. Battle in the planning office: field experts versus normative statisticians. Soc Epistemol 2008;22:389–404.10.1080/02691720802559453Suche in Google Scholar

32. Armijo-Olivo, S. The importance of determining the clinical significance of research results in physical therapy clinical research. Braz J Phys Ther 2018;22:175–6.10.1016/j.bjpt.2018.02.001Suche in Google Scholar PubMed PubMed Central

33. Moher, D, Hopewell, S, Schulz, KF, Montori, V, Gøtzsche, PC, Devereaux, PJ, et al.. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Int J Surg 2012;10:28–55.10.1016/j.ijsu.2011.10.001Suche in Google Scholar PubMed

34. Schulz, KF, Altman, DG, Moher, D. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. Br Med J 2010;340:c332.10.1136/bmj.c332Suche in Google Scholar PubMed PubMed Central

35. Moher, D, Schulz, KF, Altman, DG. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001;357:1191–4.10.1016/S0140-6736(00)04337-3Suche in Google Scholar

36. Moher, D, Hopewell, S, Schulz, KF, Montori, V, Gøtzsche, PC, Devereaux, PJ, et al.. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Br Med J 2010;340:c869.10.1136/bmj.c869Suche in Google Scholar PubMed PubMed Central

37. Moher, D, Cook, DJ, Eastwood, S, Olkin, I, Rennie, D, Stroup, DF. Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Lancet 1999;354:1896–900.10.1016/S0140-6736(99)04149-5Suche in Google Scholar

38. Vandenbroucke, JP. The making of STROBE. Epidemiology 2007;18:797–9.10.1097/EDE.0b013e318157725dSuche in Google Scholar PubMed

39. von Elm, E, Altman, DG, Egger, M, Pocock, SJ, Gøtzsche, PC, Vandenbroucke, JP. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Br Med J 2007;335:806–8.10.1136/bmj.39335.541782.ADSuche in Google Scholar PubMed PubMed Central

40. Bossuyt, PM, Reitsma, JB, Bruns, DE, Gatsonis, CA, Glasziou, PP, Irwig, LM, et al.. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Clin Chem 2003;49:7–18.10.1373/49.1.7Suche in Google Scholar PubMed

41. McQueen, MJ. The STARD initiative: a possible link to diagnostic accuracy and reduction in medical error. Ann Clin Biochem 2003;40:307–8.10.1258/000456303766476940Suche in Google Scholar PubMed

42. Bossuyt, PM, Reitsma, JB, Bruns, DE, Gatsonis, CA, Glasziou, PP, Irwig, LM, et al.. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Br Med J 2003;326:41–4.10.1136/bmj.326.7379.41Suche in Google Scholar PubMed PubMed Central

43. McShane, LM, Altman, DG, Sauerbrei, W, Taube, SE, Gion, M, Clark, GM. REporting recommendations for tumour MARKer prognostic studies (REMARK). Br J Cancer 2005;93:387–91.10.1038/sj.bjc.6602678Suche in Google Scholar PubMed PubMed Central

44. Büttner, J. History of clinical chemistry, Berlin and New York: Walter de Gruyter; 1983.Suche in Google Scholar

45. Tonks, DB. A study of the accuracy and precision of clinical chemistry determinations in 170 Canadian laboratories. Clin Chem 1963;9:217–33.10.1093/clinchem/9.2.217Suche in Google Scholar

46. Laessig, RH. Medical need for quality specifications within laboratory medicine. Upsala J Med Sci 1990;95:233–44.10.3109/03009739009178595Suche in Google Scholar PubMed

47. Sandberg, S, Fraser, CG, Horvath, AR, Jansen, R, Jones, G, Oosterhuis, W, et al.. Defining analytical performance specifications: consensus statement from the 1st strategic conference of the European federation of clinical chemistry and laboratory medicine. Clin Chem Lab Med 2015;53:833–5.10.1515/cclm-2015-0067Suche in Google Scholar PubMed

48. Horvath, AR, Bell, KJL, Ceriotti, F, Jones, GRD, Loh, TP, Lord, S, et al.. Outcome-based analytical performance specifications: current status and future challenges. Clin Chem Lab Med 2024;62:1474–82.10.1515/cclm-2024-0125Suche in Google Scholar PubMed

49. Jones, GRD, Bell, KJL, Ceriotti, F, Loh, TP, Lord, S, Sandberg, S, et al.. Applying the Milan models to setting analytical performance specifications – considering all the information. Clin Chem Lab Med 2024;62:1531–7.10.1515/cclm-2024-0104Suche in Google Scholar PubMed

50. Bossuyt, PM. Interpreting diagnostic test accuracy studies. Semin Hematol 2008;45:189–95.10.1053/j.seminhematol.2008.04.001Suche in Google Scholar PubMed

51. Bossuyt, PM. The quality of reporting in diagnostic test research: getting better, still not optimal. Clin Chem 2004;50:465–6.10.1373/clinchem.2003.029736Suche in Google Scholar PubMed

52. Bossuyt, PM, Deeks, JJ, Leeflang, MM, Takwoingi, Y, Flemyng, E. Evaluating medical tests: introducing the cochrane handbook for systematic reviews of diagnostic test accuracy. Cochrane Database Syst Rev 2023;7:Ed000163.10.1002/14651858.ED000163Suche in Google Scholar PubMed PubMed Central

53. Bossuyt, PM, Reitsma, JB, Bruns, DE, Gatsonis, CA, Glasziou, PP, Irwig, L, et al.. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Br Med J 2015;351:h5527.10.1136/bmj.h5527Suche in Google Scholar PubMed PubMed Central

54. Yang, B, Mustafa, RA, Bossuyt, PM, Brozek, J, Hultcrantz, M, Leeflang, MMG, et al.. GRADE Guidance: 31. Assessing the certainty across a body of evidence for comparative test accuracy. J Clin Epidemiol 2021;136:146–56.10.1016/j.jclinepi.2021.04.001Suche in Google Scholar PubMed

55. Lijmer, JG, Bossuyt, PM. Various randomized designs can be used to evaluate medical tests. J Clin Epidemiol 2009;62:364–73.10.1016/j.jclinepi.2008.06.017Suche in Google Scholar PubMed

56. Lijmer, JG, Leeflang, M, Bossuyt, PM. Proposals for a phased evaluation of medical tests. Med Decis Mak 2009;29:E13–21.10.1177/0272989X09336144Suche in Google Scholar PubMed

57. Greenhalgh, T. How to read a paper. Getting your bearings (deciding what the paper is about). Br Med J 1997;315:243–6.10.1136/bmj.315.7102.243Suche in Google Scholar PubMed PubMed Central

58. Guyatt, GH, Sackett, DL, Sinclair, JC, Hayward, R, Cook, DJ, Cook, RJ. Users’ guides to the medical literature. IX. A method for grading health care recommendations. Evidence-Based Medicine Working Group. JAMA 1995;274:1800–4.10.1001/jama.1995.03530220066035Suche in Google Scholar

59. Atkins, D, Eccles, M, Flottorp, S, Guyatt, GH, Henry, D, Hill, S, et al.. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches The GRADE Working Group. BMC Health Serv Res 2004;4:38.10.1186/1472-6963-4-38Suche in Google Scholar PubMed PubMed Central

60. McCall, W. Levels of evidence, causality, and clinical significance. J ECT 2007;23:69–70.10.1097/YCT.0b013e31805c0871Suche in Google Scholar PubMed

61. Melnyk, BM. Integrating levels of evidence into clinical decision making. Pediatr Nurs 2004;30:323–5.Suche in Google Scholar

62. Athanasiou, T, Darzi, A. SpringerLink ebooks M. Evidence synthesis in healthcare a practical handbook for clinicians. London: Springer; 2011.Suche in Google Scholar

63. Guyatt, G, Rennie, D, Meade, M, Cook, D. Users’ guides to the medical literature : a manual for evidence-based clinical practice. New York, NY: McGraw-Hill Education; 2015.Suche in Google Scholar

64. Guyatt, GH, Oxman, AD, Vist, GE, Kunz, R, Falck-Ytter, Y, Alonso-Coello, P, et al.. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. Br Med J 2008;336:924–6.10.1136/bmj.39489.470347.ADSuche in Google Scholar PubMed PubMed Central

65. Kavanagh, BP. The GRADE system for rating clinical guidelines. PLoS Med 2009;6:e1000094.10.1371/journal.pmed.1000094Suche in Google Scholar PubMed PubMed Central

66. Allan, GM, Finley, CR, McCormack, J, Kumar, V, Kwong, S, Braschi, E, et al.. Are potentially clinically meaningful benefits misinterpreted in cardiovascular randomized trials? A systematic examination of statistical significance, clinical significance, and authors’ conclusions. BMC Med 2017;15:58.10.1186/s12916-017-0821-9Suche in Google Scholar PubMed PubMed Central

67. Fryback, DG, Thornbury, JR. The efficacy of diagnostic imaging. Med Decis Mak 1991;11:88–94.10.1177/0272989X9101100203Suche in Google Scholar PubMed

68. Duarte, PS. Give to Fryback what is Fryback’s, and to new PET technologies what is new PET technologies. Eur J Nucl Med Mol Imag 2021;48:2676–7.10.1007/s00259-021-05454-5Suche in Google Scholar PubMed

69. Food and Drug Administration (FDA). Non-inferiority clinical trials to establish effectiveness guidance for industry. New Hampshire, NY: Federal Drug Administration; 2016:56 p.Suche in Google Scholar

70. Mauri, L, D’Agostino, RB. Challenges in the design and interpretation of noninferiority trials. N Engl J Med 2017;377:1357–67.10.1056/NEJMra1510063Suche in Google Scholar PubMed

71. Wellek, S. Testing statistical hypotheses of equivalence and noninferiority, 2nd ed. New York: Chapman and Hall/CRC; 2010.10.1201/EBK1439808184Suche in Google Scholar

72. Stefanos, R, Graziella, D, Giovanni, T. Methodological aspects of superiority, equivalence, and non-inferiority trials. Intern Emerg Med 2020;15:1085–91.10.1007/s11739-020-02450-9Suche in Google Scholar PubMed

73. Serdar, CC, Cihan, M, Yücel, D, Serdar, MA. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med (Zagreb) 2021;31:010502.10.11613/BM.2021.010502Suche in Google Scholar PubMed PubMed Central

74. Banerjee, A, Chitnis, UB, Jadhav, SL, Bhawalkar, JS, Chaudhury, S. Hypothesis testing, type I and type II errors. Ind Psychiatry J 2009;18:127–31.10.4103/0972-6748.62274Suche in Google Scholar PubMed PubMed Central

75. Sun, F, Bruening, W, Erinoff, E, Schoelles, KM. AHRQ methods for effective health care. Addressing challenges in genetic test evaluation: evaluation frameworks and assessment of analytic validity. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011.Suche in Google Scholar

76. Wang, B, Wang, H, Tu, XM, Feng, C. Comparisons of superiority, non-inferiority, and equivalence trials. Shanghai Arch Psychiatry 2017;29:385–8.Suche in Google Scholar

77. Cantwell, H. Eurachem guide: the fitness for purpose of analytical methods – a laboratory guide to method validation and related topics. Bucharest Romania: Eurachem; 2025:86 p. www.eurachem.org.Suche in Google Scholar

78. Christensen, E. Methodology of superiority vs. equivalence trials and non-inferiority trials. J Hepatol 2007;46:947–54.10.1016/j.jhep.2007.02.015Suche in Google Scholar PubMed

79. Walker, J. Non-inferiority statistics and equivalence studies. BJA Educ 2019;19:267–71.10.1016/j.bjae.2019.03.004Suche in Google Scholar PubMed PubMed Central

80. Oosterhuis, WP, Bayat, H, Armbruster, D, Coskun, A, Freeman, KP, Kallner, A, et al.. The use of error and uncertainty methods in the medical laboratory. Clin Chem Lab Med 2018;56:209–19.10.1515/cclm-2017-0341Suche in Google Scholar PubMed

81. Aarsand, AK, Díaz-Garzón, J, Fernandez-Calle, P, Guerra, E, Locatelli, M, Bartlett, WA, et al.. The EuBIVAS: within- and between-subject biological variation data for electrolytes, lipids, urea, uric acid, total protein, total bilirubin, direct bilirubin, and glucose. Clin Chem 2018;64:1380–93.10.1373/clinchem.2018.288415Suche in Google Scholar PubMed

82. Baygutalp, NK, Bakan, E, Bayraktutan, Z, Umudum, FZ. The comparison of two glucose measurement systems: POCT devices versus central laboratory. Turk J Biochem 2018;43:510–19.10.1515/tjb-2017-0196Suche in Google Scholar

83. Kaiafa, G, Veneti, S, Polychronopoulos, G, Pilalas, D, Daios, S, Kanellos, I, et al.. Is HbA1c an ideal biomarker of well-controlled diabetes? Postgrad Med J 2021;97:380–3.10.1136/postgradmedj-2020-138756Suche in Google Scholar PubMed PubMed Central

Received: 2025-02-24

Accepted: 2025-03-17

Published Online: 2025-04-08

Published in Print: 2025-07-28

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/cclm-2025-0219

Schlagwörter für diesen Artikel

method evaluation; equivalence studies; statistical significance; confidence intervals; clinical significance

Creative Commons

BY 4.0