EFLM checklist for the assessment of AI/ML studies in laboratory medicine: enhancing general medical AI frameworks for laboratory-specific applications

Anna Carobene; Janne Cadamuro; Glynis Frans; Hanoch Goldshmidt; Zeljiko Debeljak; Sander De Bruyne; William van Doorn; Johannes Elias; Habib Özdemir; Salomon Martin Perez; Helena Lame; Alexander Tolios; Federico Cabitza; Andrea Padoan; On behalf of the European Federation of Clinical Chemistry and Laboratory Medicine Committee on Digitalisation and Artificial Intelligence

doi:10.1515/cclm-2025-0841

Artikel Öffentlich zugänglich

EFLM checklist for the assessment of AI/ML studies in laboratory medicine: enhancing general medical AI frameworks for laboratory-specific applications

Anna Carobene , Janne Cadamuro , Glynis Frans , Hanoch Goldshmidt , Zeljiko Debeljak , Sander De Bruyne , William van Doorn , Johannes Elias , Habib Özdemir , Salomon Martin Perez , Helena Lame , Alexander Tolios , Federico Cabitza , Andrea Padoan und On behalf of the European Federation of Clinical Chemistry and Laboratory Medicine Committee on Digitalisation and Artificial Intelligence

Veröffentlicht/Copyright: 15. September 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Clinical Chemistry and Laboratory Medicine (CCLM)

Abstract

The integration of artificial intelligence (AI) and machine learning (ML) into laboratory medicine shows promise for advancing diagnostic, prognostic, and decision-support tools; however, routine clinical implementation remains limited and heterogeneous. Laboratory data presents unique methodological and semantic complexities – method dependency, analyte-specific variation, and contextual sensitivity-not adequately addressed by general-purpose AI reporting guidelines. To bridge this gap, the EFLM Committee on Digitalisation and Artificial Intelligence (C-AI) proposes an expanded checklist to support assessment of requirements and recommendations for the development of AI/ML models based on laboratory data. Building upon the widely adopted ChAMAI checklist (Checklist for assessment of medical AI), our proposal introduces six additional items, each grounded in the CRoss Industry Standard Process for Data Mining (CRISP-DM) framework and tailored to the specificities of laboratory workflows. These extensions address: (1) explicit documentation of laboratory data characteristics; (2) consideration of biological and analytical variability; (3) the role of metadata and peridata in contextualizing results; (4) analyte harmonization and standardization practices; (5) rigorous external validation with attention to dataset similarity; and (6) the implementation of FAIR data principles for transparency and reproducibility. Together, these recommendations aim to foster robust, interpretable, and generalizable AI systems that are fit for deployment in clinical laboratory settings. By incorporating these laboratory-aware considerations into model development pipelines, researchers and practitioners can enhance both the scientific rigor and practical applicability of AI tools. We advocate for the adoption of this extended checklist by developers, reviewers, and regulators to promote trustworthy and reproducible AI in laboratory medicine.

Keywords: artificial intelligence; laboratory medicine; machine learning; checklist; data variability; model validation

Introduction

Artificial intelligence (AI) and machine learning (ML) hold considerable promise for laboratory medicine, with applications spanning diagnostic reasoning, risk prediction, and decision support [1], 2]. To date, however, routine clinical implementation remains limited; most published systems are research prototypes rather than tools embedded in day-to-day workflows. For instance, a recent European survey found that only ∼25 % of laboratories reported ongoing AI projects, and a Clinical Chemistry mini review underscored that few published approaches have progressed to clinical deployment [3], 4]. These approaches typically leverage structured laboratory data to train predictive models, capable of deriving clinically relevant insights from real-world patient information. The growing availability of digitally recorded, structured, high-volume laboratory datasets has enabled the development of predictive models and decision support systems, trained on real-world clinical parameters. At the same time, laboratory data present domain-specific challenges, multiple variables measured repeatedly within individuals, non-random missingness, context dependence (e.g., pre-analytical conditions and measurement methods), high dimensionality, and intrinsic biological variability [5]. These characteristics can complicate model development and transportability, motivating domain-aware methods beyond generic AI frameworks [6].

In this context, a widely adopted reference is the checklist for the (self)-assessment of medical AI (ChAMAI) studies proposed by Cabitza et al., published in International Journal of Medical Informatics in 2021 (Table 1) [7]. This checklist, grounded in the CRoss Industry Standard Process for Data Mining (CRISP-DM) methodology [8], provides a structured framework for evaluating the methodological rigor, transparency, and reproducibility of AI applications in medicine. Nonetheless, it was conceived primarily with general medical informatics studies in mind and does not explicitly cover the technical and methodological complexities, inherent to laboratory-derived data, such as analyte-specific harmonization, meta- and peridata structures, or traceability requirements, all of which necessary for correct result-interpretation (for instance, the interpretation of blood glucose concentrations requires knowledge of the patient’s fasting status at the time of sample collection; similarly, accurate classification of aspartate aminotransferase (AST) results necessitates information on the degree of hemolysis).

Table 1:

Laboratory-aware reporting checklist proposed by the EFLM to complement the ChAMAI.^a

ChAMAI-checklist			EFLM extension
#	CRISP-DM phase	Item description	#	Checklist item	Details/criteria	Justification
1.	Business understanding	Clinical objective and problem definition.
		Intended use and scope of the AI system.
		Expected clinical impact and stakeholders.
2.	Data understanding	Description of data sources (e.g., EHRs, LIS, registries).	2.1	Specification of laboratory data	Document analytical method, measurement units, instrumentation (manufacturer/model)	Enables reproducibility, method comparability, and meaningful interpretation of laboratory variables.
		Definition of input features and target variable.	2.2	Biological and analytical variability	At minimum: Quantify uncertainty and report analyte-/method-specific CV_A, CV_I, CV_G (sources: IQC/validation; BV literature); summarize how variability informed modeling and interpretation. Additionally: If data augmentation is used, state what variables were perturbed, distributional assumptions and parameters, parameter provenance (instrument/lot/timeframe), the proportion augmented, and safeguards against leakage (no augmented samples in external validation). Report uncertainty (e.g., CIs, prediction/credible intervals) [9], 10].	Accounts for robustness issues due to within-subject and measurement-related uncertainty.
		Temporal coverage and data collection period.	2.3	Inclusion of metadata and peridata	At minimum: report LOINC code(s) and version for each laboratory variable; specify method-specific vs. method-agnostic LOINC; provide UCUM units, instrument/analyzer identifiers, specimen/system, collection time, and key pre-analytical factors (e.g., hemolysis index, fasting status). Additionally: Document the mapping procedure (manual/algorithmic), coverage (% mapped), unresolved/ambiguous mappings, and retain the local-to-standard crosswalk [11].	Improves semantic richness, auditability, and contextual interpretability; supports FAIR principles.
		Population characteristics and inclusion/exclusion criteria.
		Class distribution and prevalence of outcome.
		Handling of missing data.
3.	Data preparation	Data cleaning and preprocessing steps.	3.1	Harmonization and standardization	At minimum: describe standardized/harmonized methods and provide traceability chains and reference materials for calibration. Additionally: use LOINC for test/observation identity, UCUM for units, and SNOMED CT for clinical concepts; report mapping coverage and any unresolved codes	Reduces risk of bias from method-dependent measurements and ensures cross-site comparability.
		Feature engineering and transformation.
		Feature selection rationale.
		Description of training, validation, and test splits.
4.	Modeling	Algorithms and architectures considered.
		Model selection and justification.
		Hyperparameter tuning methods.
		Cross-validation strategy.
		Model performance metrics.
		Calibration of predicted probabilities.
		Strategies to prevent overfitting.
5.	Evaluation	Internal validation approach.	5.1	External validation and dataset similarity	At minimum: report whether external validation was performed; characterize dataset similarity explicitly by (i) semantic proximity (analyte definition/coding, units, reference intervals, relevant context), (ii) procedural proximity (pre-analytical conditions, methods, instruments/reagents, calibration/traceability), and (iii) cardinality (subjects/episodes/measurements per subject, class prevalence, missingness). Additionally: provide concise similarity/shift summaries (e.g., method comparability, distributional overlap). Ensure no augmented data is used in external validation [12].	Enhances generalizability and provides insight into model portability and replicability.
		External validation with independent datasets.
		Robustness analysis and generalizability.
		Uncertainty quantification.
		Comparison with standard clinical practice or benchmarks.
6.	Deployment	Clinical interpretability and explainability.	6.1	FAIR data sharing and feature traceability	At minimum: make available preprocessing code, data dictionaries, feature derivations, and terminology mappings (LOINC/UCUM/SNOMED CT) with versions, plus traceability/calibration documentation and local-to-standard crosswalks. Additionally: where data cannot be shared, provide synthetic data or mock schemas that preserve metadata/peridata structure and variable naming.	Ensures transparency, regulatory readiness, and alignment with open science and data stewardship best practices.
		Ethical, legal, and social implications.
		Availability of code, model, and data.
		Compliance with reporting standards (e.g., TRIPOD-AI).
		Plan for monitoring and updating the model post-deployment.

^aChAMAI, refers to the “International Journal of Medical Informatics” checklist for the (self)-assessment of medical AI, studies, originally published by Cabitza and Campagner [7]. The proposed additions to the different sections are to be considered as complementary extension to the ChAMAI, checklist, not a replacement) framework for AI/ML studies using clinical laboratory data. Note on proportionality: selected entries denote “At minimum” (essential) from “Additionally” (optional/desirable) reporting elements. CV_A, analytical variation; CV_I, within-subject biological variation; CV_G, between-subject biological variation; IQC, internal quality control; LOINC, logical observation identifiers names and codes; SNOMED CT, systematized nomenclature of medicine – clinical terms; UCUM, unified code for units of measure, Bold numerals = EFLM extensions; plain numerals = original ChAMAI items.

Therefore, our objective is to deliver a harmonized, laboratory-aware extension to the existing checklist, with the aim of fostering the development of AI systems that are technically robust, clinically interpretable, methodologically transparent, and, where supported by evidence and governance, potentially deployable across diverse laboratory environments.

This paper first outlines the rationale and scope of a laboratory-aware reporting checklist, then details the checklist items with domain-specific guidance and finally discusses implementation considerations and limitations.

Rationale for a laboratory-aware checklist

The need for a laboratory-aware checklist stems from recurrent quality gaps in AI/ML studies that use laboratory data. Typical shortcomings include incomplete specification of data provenance and context; limited capture and use of critical metadata (e.g., measurement procedures, units, and pre-analytical conditions); insufficient consideration of analytical and biological variability and their implications for uncertainty; underreporting of traceability chains and calibration hierarchies; and inadequate external validation across sites, populations, and measuring systems. These gaps hinder reproducibility, transportability, and fair comparison between models. The checklist we propose is designed to address these deficits explicitly, providing pragmatic reporting items that align AI methodology with laboratory standards and good measurement practice.

Checklist expansion: rationale and recommendations

The use of laboratory data in AI model development introduces specific methodological considerations that are often underrepresented in general-purpose AI checklists and datasets. Laboratory medicine operates within a well-defined total testing process (TTP), with structured workflows, traceable data generation, and standardized measurement systems [13]. Yet, real-world laboratory data are also subject to biological and analytical variability, site-specific practices, and complex metadata structures that affect both model training and generalizability.

Building on this objective, the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Committee on Digitalisation and Artificial Intelligence (C-AI) has identified a set of laboratory-specific additions designed to complement the existing ChAMAI checklist, thereby addressing methodological gaps related to the unique nature of laboratory data (Figure 1, Table 1) [7]. These additions aim to capture the specificity of laboratory data, including data complexity, the role of pre-analytical [11], analytical [14] and biological variation [15], the necessity of test harmonization [16], and the importance of metadata and peridata in ensuring reproducibility and interoperability [17]. The proposal builds on recent evidence from methodological and applied studies in laboratory AI and reflects a consensus among laboratory medicine and data science experts. These additions are structured according to the same CRISP-DM methodology used in the original ChAMAI tool and focus on aspects that are critical when laboratory data are used as input features for AI algorithms. This proposal is intended to function as a complementary extension to the ChAMAI checklist, enhancing its applicability to laboratory medicine without replacing its original structure or intent. We adopt a principle of proportionality: selected items identify an “At minimum” core that most projects can provide, with “Additionally” elements recommended when feasible; where full reporting is not possible, limitations should be stated explicitly.

Figure 1:

Graphical representation of the ChAMAI checklist (checklist for the assessment of medical AI). The six additional items introduced by the EFLM Committee on Digitalisation and Artificial Intelligence are highlighted to indicate their integration into the existing framework. These additions aim to account for the specific characteristics of laboratory data and support the development of robust, transparent, and generalizable AI/ML models in laboratory medicine.

Each item is accompanied by a justification grounded in recent scientific literature, including methodological reviews, multicenter applications, and position papers, published by experts in laboratory medicine and biomedical data science. The following paragraphs provide a detailed rationale for each proposed addition.

Although the Business Understanding phase from the original ChAMAI checklist remains fully applicable, its relevance can be further contextualized within laboratory medicine. In this domain, AI/ML models may be designed to support specific clinical and operational objectives, such as pre-analytical phase (e.g., flagging urgent or inappropriate samples), detection of analytical anomalies (e.g. quality control failures or instrument drift), or post-analytical interpretation support (e.g. identifying atypical biomarker patterns). Clearly defining these goals at the outset is critical to ensure that AI applications are aligned with the workflows, priorities, and quality requirements of laboratory practice.

1 Type of laboratory data described

Laboratory test results are not mere numerical outputs, but rather the final products of complex pre-analytical, analytical and post-analytical processes that may vary widely across institutions, methods, and devices, ultimately reflecting a patient’s (patho-) physiological levels of proteins, metabolites, or other biomolecules. The differences in pre/post and analytical processes have significant implications for the development and generalizability of AI models. Many laboratory analytes – particularly in the fields of clinical chemistry and immunochemistry – exhibit method dependency, meaning that their measured values can differ substantially depending on the analytical principle, the reagent system, the calibrator used, and the instrumentation.

For example, consider the lack of specificity associated with Jaffe methods compared to enzymatic assays in the measurement of serum creatinine, and the substantial impact that the standardization process has had on improving the quality and comparability of its measurement [18]. Similar challenges affect other measurands, particularly those measured using immunoassays, such as hormones, cardiac biomarkers, drugs, and tumor markers. Even when analytes are nominally the same (e.g., troponin I, procalcitonin or vitamin D), assays from different manufacturers may not be interchangeable due to lack of standardization and variability in antibody specificity [19], [20], [21]. In addition, analytes within the fields of toxicology or therapeutic drug monitoring, may be measured with the gold-standard method of mass-spectrometry, or the less costly and more automatable method of immunochemistry. Apart from the mentioned methodological differences, the measurement uncertainty of these two methods may differ substantially [22]. Similarly, enzymatic activities, such as alanine and aspartate aminotransferase (ALT, AST), alkaline phosphatase (ALP), or lactate dehydrogenase (LDH), are sensitive to assay conditions including temperature, pH, and buffer composition [23], [24], [25]. Consequently, values obtained from different platforms or under different analytical conditions may not be analytically equivalent, even if expressed in the same units. This is particularly relevant for assays based on immunoassay principles, where results from different manufacturers (e.g., Abbott, Roche, Siemens) can diverge substantially due to differences in antibody specificity, calibration processes, and detection technologies. Therefore, indicating the method category alone (e.g., “immunoassay”) is insufficient. For meaningful interpretation and potential reproducibility, it is essential to provide detailed information about the analytical method, platform, reagent manufacturer, and, where relevant, specific assay configurations.

Equally important is the explicit reporting of the unit of measurement for each laboratory variable, as it directly influences the numerical scale and interpretability of the data. Notably, a recent initiative for recommending the extensive adoption of the International System of Units (SI) has been promoted, because standardization is an essential tool for laboratory ISO 15189:2022 accreditation [26]. In some cases, values expressed in different units are not interconvertible due to fundamental differences in measurement principles. For instance, D-dimer may be reported in Fibrinogen Equivalent Units (FEU) or D-dimer Units (DDU), two non-equivalent systems, with approximately a two-fold difference between them [27]. However, without knowing which unit is used and the specific assay calibration, no reliable conversion is possible. Another example is HbA1c, which can be reported either as a percentage (%), following the NGSP/DCCT standard, or in mmol/mol, according to IFCC recommendations [28]. While a mathematical relationship exists between the two, it is non-linear and depends on population-level regression equations; applying it blindly, especially in the absence of assay standardization, may introduce conversion errors. Similarly, hormone concentrations for insulin or parathyroid hormone (PTH) are variably reported in mass units (e.g., µg/L) or international units (e.g., mIU/L), but no universal conversion factor exists due to assay-dependent differences in molecular structure recognition and calibration [29], 30]. In such cases, unit inconsistency may lead ML models to misinterpret identical biological values as distinct or fail to generalize across datasets. Furthermore, inconsistencies in measurement units can be particularly troublesome in various contexts, including AI applications, as they are often difficult to detect even during data curation, despite rigorous consistency checks conducted by experts or data scientists. Therefore, documenting the unit of measurement is not merely a formal detail, but a prerequisite for reliable data integration, model training, and cross-site reproducibility.

Despite their obvious importance, these aspects are rarely reported in ML studies using laboratory data. Recent reviews have highlighted that laboratory data used in AI models are often insufficiently characterized, with missing information on measurement methods, units, traceability, and analytical platforms [5], 9]. Such omissions hinder reproducibility, impair cross-site validation, and may lead to models that are overfitted to the idiosyncrasies of a particular laboratory environment.

In order to ensure meaningful interpretability and comparability of laboratory-derived data across studies and clinical settings, it is not sufficient to mention the analytical method category alone (e.g., “immunoassay”), as considerable variability exists across manufacturers and platforms even within the same methodological class. Therefore, the following details should be systematically documented:

The analytical method used, specifying the exact measurement principle, assay type, and manufacturer (e.g., electrochemiluminescence immunoassay, Roche Elecsys^® platform);
The specific platform or analyzer including instrument model, software version, and, where possible, reagent lot;
The units of measurement (e.g., mmol/L, µg/L), preferably expressed using the Unified Code for Units of Measure (UCUM) to ensure interoperability, along with conversion factors where applicable [12].

This level of transparency is particularly critical when models are trained in multicenter settings or intended for deployment across different healthcare systems. In such cases, method-related differences can introduce systematic shifts that compromise model accuracy and reliability if not properly accounted for. Therefore, we recommend that this information be explicitly required when laboratory data are used as input features in AI model development.

2 Variability addressed (biological and analytical)

One of the fundamental challenges in applying AI to laboratory medicine lies in the inherent imprecision of the data, which stems from both biological and analytical sources [31]. Laboratory results are subject to biological variation – including within-subject biological variation (CV_I) and between-subject biological variation (CV_G) components – as well as analytical variation (CV_A), which arises from measurement procedures conducted under variable conditions. These sources of variation not only affect the interpretability of laboratory tests in the clinical context but also compromise the reproducibility and generalizability of AI models trained on such data.

In clinical practice, laboratory data are intrinsically “noisy”, unavoidably affected by stochastic fluctuation and procedural inconsistencies due to biological and analytical variations. Indeed, even under standardized conditions, repeated measurements in different samples from the same individual can vary due to physiological rhythms, transient metabolic states, or pathological fluctuations. Analytical variation (namely imprecision and bias), in turn, may be introduced through differences in instrumentation, reagents, calibration materials, environmental conditions, or operator technique. As Coskun highlights [31], such variability, particularly if not explicitly modeled during algorithm development, can introduce significant uncertainty and reduce model performance, especially in external validation or deployment across diverse laboratory settings. Despite its critical impact, imprecision is rarely incorporated explicitly into the design or validation of AI workflows, limiting their clinical robustness and applicability.

Recent literature has emphasized the importance of accounting for analytical and biological variations, termed instantial variability in a mathematical view and defined as a form of uncertainty arising from the case-dependent, context-specific nature of each data point [32]. Unlike classical notions of variation, associated with population-level heterogeneity or measurement error, instantial variation encompasses the intrinsic fluctuation observed within a single subject over time or under differing conditions, reflecting both biological and analytical sources of variability. Ignoring analytical and biological variability may result in ML models that are overfitted to the training distribution and fail to generalize to new instances, even when these instances belong to the same individual measured at a different time or under slightly altered conditions. Interestingly, a mathematical model has been developed by Campagner et al. for dealing with analytical and biological variations in AI models [31].

Moreover, laboratory data often exhibits ‘soft heterogeneity’, context- and method-dependent shifts in meaning, that structured terminologies (e.g., Logical Observation Identifiers Names and Codes (LOINC) for test identity/method and Unified Code for Units of Measure (UCUM) for units) help make explicit [10], 32]. Two values referring to the same analyte may carry different clinical significance depending on pre-analytical contextual factors such as, for example, the sampling conditions (e.g., fasting vs. postprandial state, resting vs. post-exercise). These contextual elements, also referred to as peridata [14], are not always explicitly encoded in the data, yet they influence the true meaning and downstream interpretability of measurements. As demonstrated in recent studies [10], 33], standard AI models may be ill-equipped to handle such fine-grained variability, which can substantially compromise robustness, particularly in external validation or deployment scenarios. Explicitly representing this variability is therefore essential for the development of clinically trustworthy, context-aware AI systems.

To mitigate the risks associated with model miscalibration, lack of robustness, and limited generalizability in the context of laboratory medicine data, developers of AI-based clinical decision support systems should adhere to a series of best practices grounded in biostatistical principles and uncertainty-aware modeling [34]. We mention three:

Report variability metrics: Developers should quantify and disclose key sources of variability, including CV_A, CV_I, and CV_G variation. The CV_A can be derived from laboratory quality control or validation data. CV_I and CV_G can be derived from databases and literature, e.g. the EFLM Database on Biological Variation [35], or own experiments. These variation sources should be included in model cards or data sheets to contextualize the model’s reliability and limitations. In the interpretation of serum biomarkers, significant intra-individual variation over time may lead to divergent classification outcomes if such variability is not explicitly acknowledged or controlled for.
Account for individual variation: Temporal fluctuations within individuals, including both predictable patterns (e.g., circadian rhythms) and non-predictable noise, a phenomenon also referred to as individual variation (IV), represents important sources of variability. These may reflect a combination of systematic (predictable) and random (non-predictable) components. Strategies such as repeated measurements, temporal smoothing, or longitudinal designs can improve model robustness. Failure to account for such variability may lead to misclassification, even when values fall within the biologically normal range [10], 32].
Model uncertainty explicitly: Techniques such as data augmentation reflecting known CV_A/CV_I, probabilistic models, or ensemble methods can capture and propagate uncertainty. These approaches have demonstrated improved resilience to IV-related perturbations [36].

Uncertainty sources and principled augmentation

In laboratory datasets, uncertainty arises from distinct components: CV_A, CV_I, and CV_G. These have different implications for modeling. CV_A can inform measurement-error–aware augmentation by perturbing inputs with zero-mean noise proportional to the measured value, parameterized by analyte- and method-specific CV_A estimated from Internal Quality Control (IQC) or validation studies [31]. This probes sensitivity to realistic instrument/reagent imprecision. CV_I is more appropriate for simulating within-subject longitudinal variability (e.g., trajectories across time), not for generic single-timepoint noise; CV_G reflects population heterogeneity and should be addressed via sampling design and external validation. Any augmentation should (i) be parameterized per analyte (and method/procedure where relevant), (ii) document parameter provenance (e.g., IQC period, lot, instrument) and distributional assumptions, (iii) be kept strictly separate from external validation to avoid leakage, and (iv) be accompanied by uncertainty quantification for key outputs (e.g., bootstrap CIs, prediction intervals, or Bayesian credible intervals) [33], 34]. Augmentation does not remove uncertainty; it supports robustness analysis and calibration that must be transparently reported.

Collectively, these practices promote a principled integration of measurement uncertainty into the design, evaluation, and deployment of clinical AI systems, thereby increasing their safety, interpretability, and alignment with real-world laboratory conditions.

3 Use of metadata and peridata

As described above, the clinical and analytical validity of laboratory data used in AI models cannot be properly assessed without consideration of the contextual information that accompanies the test result. In this regard, two categories of supporting data are particularly relevant: metadata and peridata. According to the recent EFLM position paper [17], metadata is defined as:

“Data derived from the testing process that describe the characteristics and the requirements that are relevant for assessing the quality and the validity of laboratory test results.”

Peridata, on the other hand, is:

“Data derived from the testing process that are relevant for the interpretation of the results within the clinical context, making that data actionable for the patients’ care.”

We recommend encoding laboratory observations with LOINC and representing measurement units with the UCUM. LOINC’s six-axis model (Component, Property, Time, System, Scale, Method) makes method and context dependence explicit [37], while UCUM provides machine-readable, unambiguous units [12]. Contextual factors (e.g., fasting status) should be represented as separate LOINC-coded observations or controlled peridata fields linked to the primary result. For related clinical concepts (e.g., disorders, procedures, findings) linked to laboratory results, we recommend using Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) [38], while continuing to use LOINC for test/observation identifiers and UCUM for units.

These elements are often available in laboratory information systems but are rarely exported, structured, or reused in the context of AI development. Metadata includes descriptors such as test name and code, analytical method, instrument identifier, units of measurement, reference intervals, calibration traceability, and timestamps. Peridata encompass variables such as sample type and quality (e.g., hemolysis, lipemia), pre-analytical conditions (e.g., time from collection to analysis), clinical setting (e.g., emergency vs. outpatient), and the circumstances under which the test was ordered.

In the absence of this structured information, laboratory results are treated as decontextualized numerical values, which risks compromising the interpretability, generalizability, and clinical usability of AI models. For instance, a troponin result obtained from a point-of-care device in the emergency department may not be directly comparable to one measured in a central laboratory using a high-sensitivity assay. Similarly, glucose values may differ depending on the delay between sampling and centrifugation or by using different additives in sample containers [39], an issue that peridata could help detect and correct for.

To enhance the semantic depth and interoperability of laboratory-derived features in AI workflows, it is essential to:

Extract, structure, and integrate key metadata and peridata fields as part of the AI development pipeline
Include them in the data curation and during the model development phases
Use them to support interpretability, auditability, and harmonization across datasets

For example, for a serum glucose result, include the LOINC code and version, the UCUM unit, the measurement method (if using a method-specific LOINC), instrument/reagent identifiers, and a linked LOINC-coded observation for fasting status.

The inclusion of metadata and peridata also aligns with FAIR (Findable, Accessible, Interoperable, Reusable) data principles, supporting responsible data stewardship in clinical AI applications [17], see Section 6: “FAIR Data Sharing and Feature Reproducibility” for practical reporting expectations. Their systematic incorporation can substantially enhance both scientific reproducibility and clinical relevance, thereby strengthening the translational value of AI models in laboratory medicine.

4 Harmonization and standardization of laboratory measurements

In the context of AI model development based on laboratory data, it is not sufficient to describe how each test result is generated (method, unit, instrument); it is also necessary to understand whether those results are analytically comparable across different laboratory settings [40]. One example of a strategy for improving data harmonization is the use of an international normalized ratio (INR) for reducing the between-laboratory variation, regardless of the type of assay used [41].

Laboratory medicine, unlike many other domains of clinical data, has invested considerable effort in establishing standardization and harmonization protocols to reduce inter-method variability and ensure the traceability of results to reference systems [42]. However, these efforts are analyte-specific, unevenly implemented, and incomplete for many clinically relevant tests [43], 44].

Standardization refers to the establishment of metrological traceability to recognized reference methods or materials, enabling full result comparability regardless of the testing system used [45]. Harmonization, in contrast, aims to align results across different methods in the absence of a formal reference [46]. Numerous widely used analytes, such as glucose, creatinine, HbA1c, and cholesterol, have established metrological traceability and are generally suitable for multicenter AI model training [47], 48]. In contrast, many immunoassay-based measurements (e.g., TSH, troponins, BNP, vitamin D) still suffer from significant method-dependent variability due to differences in reagent composition, antibody specificity, and calibration practices [49], 50].

Another example is the measurement of Lp(a). While current evidence demonstrates that increased Lp(a) is a causal risk factor for cardiovascular outcomes, care must be given to the interpretation of Lp(a) results because Lp(a) assays are not yet internationally standardized [51], 52]. In addition, two different measurement units are currently being used. One unit measures Lp(a) particle numbers in nmol/L and the other measures Lp(a) mass in mg/dL. Since the mass of Lp(a) particles is variable, a direct conversion between the two can only be an approximation [53]. The recent EuBIVAS (European Biological Variation Study) provided robust evidence confirming the substantial biological and methodological variability in Lp(a) measurement and reinforced the need for updated, standardized analytical performance specifications to ensure clinical utility [52].

This variability may go unnoticed in single-center studies but becomes critical in multicenter research, federated learning – a privacy-preserving approach that trains models across sites by exchanging model updates rather than raw data – , or real-world implementation where both semantic and procedural proximity between datasets can materially affect observed performance [54], 55]. A model trained on non-harmonized data may inadvertently capture local analytical biases instead of biologically meaningful patterns, limiting its clinical reliability and portability. From a clinical perspective, the lack of harmonization forces patients to rely on a single laboratory over time to ensure result reproducibility, which may compromise continuity of care, limit diagnostic flexibility, and exacerbate healthcare inequities. Ensuring inter-laboratory comparability is therefore not only a methodological necessity, but also a clinical and ethical imperative.

Only by assessing whether a laboratory feature can be meaningfully compared across environments can we ensure the safe deployment and equitable performance of AI systems in laboratory medicine.

In addressing the complexity of laboratory data for AI/ML applications, it is essential to recognize that metrological traceability alone may be insufficient to ensure meaningful analytical comparability across settings. In fact, even when traceability is declared by the manufacturer, results may remain non-interchangeable due to methodological differences of performance, including the matrix effects, and variable uncertainty propagation in routine clinical use [56]. This perspective aligns with long-standing concerns in laboratory medicine that high-level harmonization efforts, while conceptually rigorous, may not fully capture the analytical heterogeneity observed in real-world conditions, particularly in multi-analyzer or multicenter environments. Thus, the inclusion of explicit documentation on analytical methods, uncertainty estimation, and harmonization strategies is not merely a best practice but a prerequisite for building trustworthy, generalizable AI models in laboratory medicine.

To mitigate these risks, we recommend that studies developing AI models using laboratory data:

Explicitly state whether the analytes used are standardized or harmonized, and describe the level of traceability to reference systems;
Provide information on calibration hierarchies, traceability chains, or use of certified reference materials;
Identify analytes lacking harmonization as method-dependent and validate associated models accordingly;
Document terminology mapping strategies (e.g., local-to-standard code conversions), while taking into account ongoing initiatives by EFLM Committees on the Exchange of Laboratory Data [57], aimed at advancing semantic interoperability through frameworks such as LOINC and SNOMED CT.

This information builds upon the descriptive requirements outlined in point 1 but adds an essential dimension of inter-site comparability.

5 External validation with attention to dataset similarity and cardinality

External validation is widely recognized as a critical step in the evaluation of AI models in healthcare [3], 58], 59]. It provides evidence that a model performs well not only on the data it was trained on but also on independent datasets that reflect real-world variability. In laboratory medicine, this step is especially important due to the heterogeneity of testing practices, instrumentation, population characteristics, and pre-analytical conditions and all other above-mentioned variables across institutions [33], 50].

Despite its importance, external validation is often overlooked or inadequately reported in studies using laboratory data for AI development. Moreover, when validation is performed, little attention is typically paid to the semantic and procedural similarity between training and validation datasets, which is a key factor in interpreting validation results. Here, we use semantic proximity to denote the degree to which variables have the same meaning across datasets – encompassing analyte definition/coding (e.g., LOINC), units (e.g., UCUM), reference intervals, and relevant contextual factors (e.g., fasting status) – and procedural proximity to denote the similarity of data-generating procedures, including pre-analytical conditions, measurement methods, instruments/reagents, and calibration/traceability practices; by cardinality we mean the size and structure of the datasets (e.g., numbers of subjects, episodes, and measurements per subject, class prevalence, and missingness). As highlighted in recent methodological work, including the concept of meta-validation proposed by Cabitza et al., both the cardinality (sample size and structure) and the semantic and procedural proximity between datasets must be considered to assess whether a validation is genuinely informative [10].

Laboratory data is influenced by local institution-specific practices that affect feature distributions in non-trivial ways. Even standardized tests may differ in reference intervals, calibration, or measurement traceability. Without detailed documentation and comparison of these parameters, performance metrics derived from external validation may be misleading, either by inflating model robustness when datasets are overly similar, or by underestimating generalizability when measurement systems are poorly harmonized.

In addition to the original statements reported on the original validation checklist, we recommend that studies using laboratory data [7]:

Clearly specify whether external validation was performed and describe the selection criteria and characteristics of the validation dataset
Report on analytical comparability between datasets, including methods, units, and instruments used
Characterize population overlap and distributional shifts between development and validation cohorts
Consider adopting meta-validation strategies to assess the relevance and representativeness of the validation cohort relative to the development data

These practices are essential for developing AI systems that are not only accurate under controlled conditions but also resilient to the variability, introduced by real-world clinical laboratory environments. In line with EFLM priorities, attention to external validation is not only a scientific imperative, but a practical necessity for ensuring the clinical applicability of AI tools in laboratory medicine.

6 FAIR data sharing and features reproducibility

Transparency and reproducibility are foundational principles in scientific research and are particularly critical in the development of AI models intended for clinical use. While much attention is given to model architecture and performance metrics, the data processing workflows – including how laboratory features are curated, transformed, and documented – often remain insufficiently described. This lack of transparency poses a significant barrier to model replication, regulatory review, and clinical adoption.

The application of the FAIR principles – Findable, Accessible, Interoperable, Reusable – to laboratory data is still emerging but is increasingly recognized as essential for robust and trustworthy AI development, although practical adoption remains variable [60]. Although this field is still in its early stages, a growing number of frameworks are starting to support FAIR aligned data practices, including the use of cloud-based platforms for data management, integration, and sharing [61], 62]. FAIRness goes beyond simply sharing raw data: it encompasses the ability to trace how variables were included in the model, how outliers were handled, how missing data were imputed, and whether domain-specific transformations (e.g., normalization to reference intervals, log transformations for skewed distributions) were applied. In the laboratory context, where each variable may carry implicit assumptions related to measurement units, calibration, or biological variation, detailed documentation of the data pipeline is essential for interpretability and reproducibility.

Importantly, this point does not duplicate the requirement to describe laboratory data characteristics (as in point 1 and 4), nor does it address inter-laboratory comparability (as in point 5). Instead, it focuses on ensuring that the entire data processing and modeling pipeline can be audited and reused, whether by external researchers, regulators, or clinical collaborators.

To meet this objective, we recommend to:

Provide comprehensive documentation of all preprocessing steps, transformations, and feature derivations used during model development
Make available, where ethically and legally permissible, code repositories, data dictionaries, and mapping files (e.g., local-to-standard code conversions)
Clearly trace the provenance of each variable – from raw measurement to model input – be clearly traceable
Where direct data sharing is not feasible, supply synthetic datasets, anonymized test cases, or detailed mock schemas to illustrate data structure and preprocessing logic

By adhering to these practices, developers align with current expectations for open science and responsible AI [60]. Moreover, such transparency fosters collaboration, facilitates regulatory review, and accelerates scientific progress by enabling others to build upon validated laboratory-AI pipelines.

Conclusions

As interest in artificial intelligence grows within laboratory medicine, the development of methodologically sound studies and clinically reliable models must be guided by standards that reflect the specificities and constraints of laboratory data generation and interpretation. While existing AI checklists offer valuable general guidance, they often overlook the distinctive features of the total testing process, the analytical infrastructure, and the semantic richness of laboratory-derived information.

In this Position Paper, the EFLM C-AI proposes a set of six domain-specific additions to an established AI/ML checklist, with the goal of addressing core challenges related to test variability, method dependency, metadata integration, inter-laboratory harmonization, external validation, and FAIR data practices. These items are grounded in recent methodological literature and reflect expert consensus at the intersection of laboratory medicine and biomedical data science.

The proposed extensions to the original checklist from Cabitza & Campagner [7] are intended to improve the reproducibility, interoperability, and clinical applicability of AI models trained on laboratory data, ensuring that derived tools are both scientifically rigorous and fit for use in real-world healthcare systems. By embracing these recommendations, developers can support the development of AI solutions that are not technically accurate but also context-aware, ethically grounded, and resilient to variation across time, space, and clinical settings. Ultimately, the goal is to facilitate careful evaluation and, where appropriate, implementation of ML models in laboratory medicine.

We recognize that richer reporting imposes additional effort, particularly for teams new to laboratory standards. However, the up-front cost improves reproducibility, accelerates cross-site validation and harmonization, and reduces downstream rework – benefits that repay the initial investment.

We invite researchers, journal editors, peer reviewers, and regulatory authorities to consider incorporating this checklist into their evaluation criteria, thereby promoting a shared framework for trustworthy and transparent AI in laboratory medicine.

Corresponding author: Anna Carobene, Laboratory Medicine, IRCCS, San Raffaele Scientific Institute, Via Olgettina, 60, 20132, Milan, Italy, E-mail: carobene.anna@hsr.it

Acknowledgments

We acknowledge the use of ChatGPT (OpenAI, San Francisco, CA, USA) to support the preparation of this manuscript. The tool was employed for assistance in language refinements and structural editing. All scientific content and interpretations remain the sole responsibility of the authors.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. Anna Carobene: conceived the manuscript; coordinated the development of the extended checklist; drafted the initial version of the manuscript; integrated feedback from all co-authors. Andrea Padoan: provided substantial critical revisions and conceptual guidance; contributed to the methodological refinement of checklist items specific to laboratory medicine. Janne Cadamuro: critically revised the manuscript for intellectual content; contributed to the alignment of checklist elements with laboratory workflows and clinical relevance. Federico Cabitza: contributed significantly to the theoretical framework and conceptual foundations of the manuscript; offered critical revisions and ensured methodological coherence with the original ChAMAI checklist. Glynis Frans, Hanoch Goldshmidt, Zeljiko Debeljak, Sander De Bruyne, William van Doorn, Johannes Elias, Habib Özdemir, Salomon Martin Perez, Helena Lame, Alexander Tolios: Participated in the original brainstorming sessions and working group discussions within the EFLM Committee on Digitalisation and Artificial Intelligence; contributed expert feedback and final approval of the manuscript content. All authors: Reviewed and approved the final version of the manuscript and agreed to be accountable for all aspects of the work.
Use of Large Language Models, AI and Machine Learning Tools: ChatGPT (OpenAI, San Francisco, CA, USA) to support the preparation of this manuscript.
Conflict of interest: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Not applicable.

References

1. Pillay, T, Topcu, D, Yenice, S. Harnessing AI for enhanced evidence-based laboratory medicine. Clin Chim Acta 2025;569:120181. https://doi.org/10.1016/j.cca.2025.120181.Suche in Google Scholar PubMed

2. Çubukçu, H, Topcu, D, Yenice, S. Machine learning-based clinical decision support using laboratory data. Clin Chem Lab Med 2023;62:793–823. https://doi.org/10.1515/cclm-2023-1037.Suche in Google Scholar PubMed

3. Spies, N, Farnsworth, C, Wheeler, S, McCudden, C. Validating, implementing, and monitoring machine learning solutions in the clinical laboratory safely and effectively. Clin Chem 2024;70:1334–43. https://doi.org/10.1093/clinchem/hvae126.Suche in Google Scholar PubMed

4. Cadamuro, J, Carobene, A, Cabitza, F, Debeljak, Z, De Bruyne, S, van Doorn, W, et al.. A comprehensive survey of artificial intelligence adoption in European laboratory medicine: current utilization and prospects. Clin Chem Lab Med 2024;63:692–703. https://doi.org/10.1515/cclm-2024-1016.Suche in Google Scholar PubMed

5. Carobene, A, Milella, F, Famiglini, L, Cabitza, F. How is test laboratory data used and characterised by machine learning models? A systematic review of diagnostic and prognostic models developed for COVID-19 patients using only laboratory data. Clin Chem Lab Med 2022;60:1887–901. https://doi.org/10.1515/cclm-2022-0182.Suche in Google Scholar PubMed

6. Yang, H, Rhoads, D, Sepulveda, J, Zang, C, Chadburn, A, Wang, F. Building the model. Arch Pathol Lab Med 2023;147:826–36. https://doi.org/10.5858/arpa.2021-0635-RA.Suche in Google Scholar PubMed PubMed Central

7. Cabitza, F, Campagner, A. The need to separate the wheat from the chaff in medical informatics: introducing a comprehensive checklist for the -assessment of medical AI studies. Int J Med Inf 2021;153:104510. https://doi.org/10.1016/j.ijmedinf.2021.104510.Suche in Google Scholar PubMed

8. Wirth, R, Hipp, J. CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. London, UK: Springer-Verlag; 2000, vol 1.Suche in Google Scholar

9. Agnello, L, Vidali, M, Padoan, A, Lucis, R, Mancini, A, Guerranti, R, et al.. Machine learning algorithms in sepsis. Clin Chim Acta 2024;553:117738. https://doi.org/10.1016/j.cca.2023.117738.Suche in Google Scholar PubMed

10. Cabitza, F, Campagner, A, Soares, F, García, D, Challa, F, Sulejmani, A, et al.. The importance of being external. Methodological insights for the external validation of machine learning models in medicine. Comput Methods Progr Biomed 2021;208:106288. https://doi.org/10.1016/j.cmpb.2021.106288.Suche in Google Scholar PubMed

11. Banfi, G, Lippi, G. The impact of preanalytical variability in clinical trials: are we underestimating the issue? Ann Transl Med 2016;4:59. https://doi.org/10.3978/j.issn.2305-5839.2016.01.22.Suche in Google Scholar PubMed PubMed Central

12. Schadow & McDonald. UCUM v2.2 2024; 2017. https://ucum.org/ucum [Accessed 21 June 2025].Suche in Google Scholar

13. Plebani, M, Laposata, M, Lundberg, G. The brain-to-brain loop concept for laboratory testing 40 years after its introduction. Am J Clin Pathol 2011;136:829–33. https://doi.org/10.1309/AJCPR28HWHSSDNON.Suche in Google Scholar PubMed

14. Reed, G, Lynn, F, Meade, B. Use of coefficient of variation in assessing variability of quantitative assays. Clin Diagn Lab Immunol 2002;9:1235–9. https://doi.org/10.1128/cdli.9.6.1235-1239.2002. Erratum in: Clin Diagn Lab Immunol 2003;10:1162.Suche in Google Scholar PubMed PubMed Central

15. Sandberg, S, Carobene, A, Bartlett, B, Coskun, A, Fernandez-Calle, P, Jonker, N, et al.. Biological variation: recent development and future challenges. Clin Chem Lab Med 2022;61:741–50. https://doi.org/10.1515/cclm-2022-1255.Suche in Google Scholar PubMed

16. Miller, W, Greenberg, N. Harmonization and standardization: where are we now? J Appl Lab Med 2021;6:510–21. https://doi.org/10.1093/jalm/jfaa189.Suche in Google Scholar PubMed

17. Padoan, A, Cadamuro, J, Frans, G, Cabitza, F, Tolios, A, De, B, et al.. Data flow in clinical laboratories: could metadata and peridata bridge the gap to new AI-based applications? Clin Chem Lab Med 2024;63:684–91. https://doi.org/10.1515/cclm-2024-0971.Suche in Google Scholar PubMed

18. Carobene, A, Ceriotti, F, Infusino, I, Frusciante, E, Panteghini, M. Evaluation of the impact of standardization process on the quality of serum creatinine determination in Italian laboratories. Clin Chim Acta 2014;427:100–6. https://doi.org/10.1016/j.cca.2013.10.001.Suche in Google Scholar PubMed

19. Tate, J, Bunk, D, Christenson, R, Barth, J, Katrukha, A, Noble, J, et al.. Evaluation of standardization capability of current cardiac troponin I assays by a correlation study: results of an IFCC pilot project. Clin Chem Lab Med 2015;53:677–90. https://doi.org/10.1515/cclm-2014-1197.Suche in Google Scholar PubMed

20. Atef, S. Vitamin D assays in clinical laboratory: past, present and future challenges. J Steroid Biochem Mol Biol 2018;175:136–7. https://doi.org/10.1016/j.jsbmb.2017.02.011.Suche in Google Scholar PubMed

21. Ceriotti, F, Marino, I, Motta, A, Carobene, A. Analytical evaluation of the performances of Diazyme and BRAHMS procalcitonin applied to Roche Cobas in comparison with BRAHMS PCT-sensitive Kryptor. Clin Chem Lab Med 2017;56:162–9. https://doi.org/10.1515/cclm-2017-0159.Suche in Google Scholar PubMed

22. Cattaneo, D, Panteghini, M. Analytical performance specifications for measurement uncertainty in therapeutic monitoring of immunosuppressive drugs. Clin Chem Lab Med 2023;62:e81–3. https://doi.org/10.1515/cclm-2023-1063.Suche in Google Scholar PubMed

23. Talli, I, Marchioro, L, Zaninotto, M, Cosma, C, Pangrazzi, E, Artusi, C, et al.. Measurement of AST and ALT with Pyridoxal-5’-Phosphate according to IFCC: a decades-long gap seems to be filled. Clin Chim Acta 2025;569:120158. https://doi.org/10.1016/j.cca.2025.120158.Suche in Google Scholar PubMed

24. Cattozzo, G, Albeni, C, Franzini, C. Harmonization of values for serum alkaline phosphatase catalytic activity concentration employing commutable calibration materials. Clin Chim Acta 2010;411:882–5. https://doi.org/10.1016/j.cca.2010.03.008.Suche in Google Scholar PubMed

25. Cattozzo, G, Guerra, E, Ceriotti, F, Franzini, C. Commutable calibrator with value assigned by the IFCC reference procedure to harmonize serum lactate dehydrogenase activity results measured by 2 different methods. Clin Chem 2008;54:1349–55. https://doi.org/10.1373/clinchem.2007.100081.Suche in Google Scholar PubMed

26. Zaninotto, M, Agnello, L, Dukic, L, Šálek, T, Linko-Parvinen, A, Kalaria, T, et al.. Is it feasible for European laboratories to use SI units in reporting results? Clin Chem Lab Med 2025;63:1279–85. https://doi.org/10.1515/cclm-2025-0113.Suche in Google Scholar PubMed

27. Favaloro, E, Thachil, J. Reporting of D-dimer data in COVID-19: some confusion and potential for misinformation. Clin Chem Lab Med 2020;58:1191–9. https://doi.org/10.1515/cclm-2020-0573.Suche in Google Scholar PubMed

28. Panteghini, M, John, W. Implementation of haemoglobin A1c results traceable to the IFCC reference system: the way forward. Clin Chem Lab Med 2007;45:942–4. https://doi.org/10.1515/CCLM.2007.198.Suche in Google Scholar PubMed

29. Rodríguez-Cabaleiro, D, Van, U, Stove, V, Fiers, T, Thienpont, L. Pilot study for the standardization of insulin immunoassays with isotope dilution liquid chromatography/tandem mass spectrometry. Clin Chem 2007;53:1462–9. https://doi.org/10.1373/clinchem.2007.088393.Suche in Google Scholar PubMed

30. Cavalier, E. Parathyroid hormone results interpretation in the background of variable analytical performance. Journal of J Lab Precis Med 2019;4:1. https://doi.org/10.21037/jlpm.2018.12.03.Suche in Google Scholar

31. Coskun, A. Are your laboratory data reproducible? The critical role of imprecision from replicate measurements to clinical decision-making. Ann Lab Med 2025;45:259–71. https://doi.org/10.3343/alm.2024.0569.Suche in Google Scholar PubMed PubMed Central

32. Campagner, A, Famiglini, L, Carobene, A, Cabitza, F. Everything is varied: the surprising impact of instantial variation on ML reliability. Appl Soft Comput 2023;146:110644. https://doi.org/10.1016/j.asoc.2023.110644.Suche in Google Scholar

33. Campagner, A, Agnello, L, Carobene, A, Padoan, A, Del Ben, F, Locatelli, M, et al.. Complete blood count and monocyte distribution width-based machine learning algorithms for sepsis detection: multicentric development and external validation study. J Med Internet Res 2025;27:e55492. https://doi.org/10.2196/55492.Suche in Google Scholar PubMed PubMed Central

34. Marconi, L, Cabitza, F. Show and tell: a critical review on robustness and uncertainty for a more responsible medical AI. Int J Med Inf 2025;202:105970. https://doi.org/10.1016/j.ijmedinf.2025.105970.Suche in Google Scholar PubMed

35. Aarsand, AK, Webster, C, Fernandez-Calle, P, Jonker, N, Diaz-Garzon, J, Coskun, A, et al.. The EFLM biological variation database. https://biologicalvariation.eu/[Accessed 21 June 2025].Suche in Google Scholar

36. Campagner, A, Biganzoli, EM, Balsano, C, Cereda, C, Cabitza, F. Modeling unknowns: a vision for uncertainty-aware machine learning in healthcare. Int J Med Inf 2025;203:106014. [Epub ahead of print] https://doi.org/10.1016/j.ijmedinf.2025.106014.Suche in Google Scholar PubMed

37. Regenstrief Institute. LOINC users’ guide. Indianapolis. https://www.regenstrief.org/real-world-solutions/loinc/[Accessed 11 August 2025].Suche in Google Scholar

38. SNOMED International. SNOMED CT starter guide. London: SNOMED International. https://docs.snomed.org/snomed-ct-practical-guides [Accessed 11 Aug 2025].Suche in Google Scholar

39. Lippi, G, Nybo, M, Cadamuro, J, Guimaraes, J, van Dongen-Lases, E, Simundic, A. Blood glucose determination: effect of tube additives. Adv Clin Chem 2018;84:101–23. https://doi.org/10.1016/bs.acc.2017.12.003.Suche in Google Scholar PubMed

40. Greg, M, Myers, G, Lou, G, Kahn, S, Schönbrunner, E, Thienpont, L, et al.. Roadmap for harmonization of clinical laboratory measurement procedures. Clin Chem 2011;57:1108–17. https://doi.org/10.1373/clinchem.2011.164012.Suche in Google Scholar PubMed

41. Meijer, P, Kynde, K, van den Besselaar, A, Van Blerk, M, Woods, T. International normalized ratio (INR) testing in Europe: between-laboratory comparability of test results obtained by Quick and Owren reagents. Clin Chem Lab Med 2018;56:1698–703. https://doi.org/10.1515/cclm-2017-0976.Suche in Google Scholar PubMed

42. Vesper, H, Myers, G, Miller, W. Current practices and challenges in the standardization and harmonization of clinical laboratory tests. Am J Clin Nutr 2016;104:907S–12S. https://doi.org/10.3945/ajcn.115.110387.Suche in Google Scholar PubMed PubMed Central

43. Panteghini, M, Krintus, M. Establishing, evaluating and monitoring analytical quality in the traceability era. Crit Rev Clin Lab Sci 2025;62:148–81. https://doi.org/10.1080/10408363.2024.2434562.Suche in Google Scholar PubMed

44. Plebani, M, Lippi, G. Standardization and harmonization in laboratory medicine: not only for clinical chemistry measurands. Clin Chem Lab Med 2022;61:185–7. https://doi.org/10.1515/cclm-2022-1122.Suche in Google Scholar PubMed

45. International vocabulary of metrology-basic and general concepts and associated terms (VIM) JCGM 200:2012, 3rd ed. Geneva: VIM; 2012.Suche in Google Scholar

46. Plebani, M. Harmonization in laboratory medicine: the complete picture. Clin Chem Lab Med 2013;51:741–51. https://doi.org/10.1515/cclm-2013-0075.Suche in Google Scholar PubMed

47. Panteghini, M. Traceability as a unique tool to improve standardization in laboratory medicine. Clin Biochem 2009;42:236–40. https://doi.org/10.1016/j.clinbiochem.2008.09.098.Suche in Google Scholar PubMed

48. Little, R, Rohlfing, C, Wiedmeyer, H, Myers, G, Sacks, D, Goldstein, D, et al.. The national glycohemoglobin standardization program: a five-year progress report. Clin Chem 2001;47:1985–92. PMID: 11673367.Suche in Google Scholar

49. Clerico, A, Zaninotto, M, Padoan, A, Ndreu, R, Musetti, V, Masotti, S, et al.. Harmonization of two hs-cTnI methods based on recalibration of measured quality control and clinical samples. Clin Chim Acta 2020;510:150–6. https://doi.org/10.1016/j.cca.2020.07.009.Suche in Google Scholar PubMed

50. Zaninotto, M, Graziani, M, Plebani, M. The harmonization issue in laboratory medicine: the commitment of CCLM. Clin Chem Lab Med 2022;61:721–31. https://doi.org/10.1515/cclm-2022-1111.Suche in Google Scholar PubMed

51. Kronenberg, F, Mora, S, Stroes, ESG, Ference, BA, Arsenault, BJ, Berglund, L, et al.. Lipoprotein(a) in atherosclerotic cardiovascular disease and aortic stenosis: a European Atherosclerosis Society consensus statement. Eur Heart J 2022;43:3925–46. https://doi.org/10.1093/eurheartj/ehac361.Suche in Google Scholar PubMed PubMed Central

52. Clouet-Foraison, N, Marcovina, SM, Guerra, E, Aarsand, AK, Coşkun, A, Díaz-Garzón, J, et al.. Analytical performance specifications for Lipoprotein(a), Apolipoprotein B-100, and Apolipoprotein A-I using the biological variation model in the EuBIVAS population. Clin Chem 2020;66:727–36. https://doi.org/10.1093/clinchem/hvaa054.Suche in Google Scholar PubMed

53. Kronenberg, F, Mora, S, Stroes, ESG, Ference, BA, Arsenault, BJ, Berglund, L, et al.. Frequent questions and responses on the 2022 lipoprotein(a) consensus statement of the European Atherosclerosis Society. Atherosclerosis 2023;374:107–20. https://doi.org/10.1016/j.atherosclerosis.2023.04.012.Suche in Google Scholar PubMed

54. McMahan, HB, Moore, E, Ramage, D, Hampson, S, Arcas, BA. Communication-efficient learning of deep networks from decentralized data. Proc 20th Int Conf Artif Intell Statistics (AISTATS) 2017;54:1273–82. https://doi.org/10.48550/arXiv.1602.05629.PMLRSuche in Google Scholar

55. Kairouz, P, McMahan, HB, Avent, B, Bellet, A, Bennis, M, Bhagoji, AN, et al.. Advances and open problems in federated learning. Found Trends Mach Learn 2021;14:1–210. https://doi.org/10.1561/2200000083.Suche in Google Scholar

56. Çubukçu, HC, Thelen, M, Plebani, M, Panteghini, M. IFCC recommendations for internal quality control practice: a missed opportunity. Clin Chem Lab Med 2025;63:1693–17. https://doi.org/10.1515/cclm-2025-0486.Suche in Google Scholar PubMed

57. EFLM Committee. Exchange of laboratory data. https://www.eflm.eu/site/who-we-are/divisions/science-division/fu/tc-exchange-laboratory-data [Accessed 24 June 2025].Suche in Google Scholar

58. Master, S, Badrick, T, Bietenbeck, A, Haymond, S. Machine learning in laboratory medicine: recommendations of the IFCC working group. Clin Chem 2023;69:690–8. https://doi.org/10.1093/clinchem/hvad055.Suche in Google Scholar PubMed PubMed Central

59. Campagner, A, Carobene, A, Cabitza, F. External validation of machine learning models for COVID-19 detection based on complete blood count. Health Inf Sci Syst 2021;9:37. https://doi.org/10.1007/s13755-021-00167-3.Suche in Google Scholar PubMed PubMed Central

60. Blatter, TU, Witte, H, Nakas, CT, Leichtle, AB. Big data in laboratory Medicine-FAIR quality for AI? Diagnostics 2022;12:1923. https://doi.org/10.3390/diagnostics12081923.Suche in Google Scholar PubMed PubMed Central

61. Inau, ET, Dedié, A, Anastasova, I, Schick, R, Zdravomyslov, Y, Fröhlich, B, et al.. The journey to a FAIR CORE DATA SET for diabetes research in Germany. Sci Data 2024;11:1159. https://doi.org/10.1038/s41597-024-03882-0.Suche in Google Scholar PubMed PubMed Central

62. Abaza, H, Shutsko, A, Klopfenstein, SAI, Vorisek, CN, Schmidt, CO, Brünings-Kuppe, C, et al.. From raw data to FAIR data: the FAIRification workflow for health research. Methods Inf Med 2020;59:e21–32. https://doi.org/10.1055/s-0040-1713684.Suche in Google Scholar PubMed

Received: 2025-07-07

Accepted: 2025-09-03

Published Online: 2025-09-15

https://doi.org/10.1515/cclm-2025-0841

Schlagwörter für diesen Artikel

artificial intelligence; laboratory medicine; machine learning; checklist; data variability; model validation