Abstract
Objectives
Accurate medical laboratory reports are essential for delivering high-quality healthcare. Recently, advanced artificial intelligence models, such as those in the ChatGPT series, have shown considerable promise in this domain. This study assessed the performance of specific GPT models-namely, 4o, o1, and o1 mini-in identifying errors within medical laboratory reports and in providing treatment recommendations.
Methods
In this retrospective study, 86 medical laboratory reports of Nucleic acid test report for the seven upper respiratory tract pathogens were compiled. There were 285 errors from four common error categories intentionally and randomly introduced into reports and generated 86 incorrected reports. GPT models were tasked with detecting these errors, using three senior medical laboratory scientists (SMLS) and three medical laboratory interns (MLI) as control groups. Additionally, GPT models were tasked with generating accurate and reliable treatment recommendations following positive test outcomes based on 86 corrected reports. χ2 tests, Kruskal-Wallis tests, and Wilcoxon tests were used for statistical analysis where appropriate.
Results
In comparison with SMLS or MLI, GPT models accurately detected three error types, and the average detection rates of the three GPT models were 88.9 %(omission), 91.6 % (time sequence), and 91.7 % (the same individual acted both as the inspector and the reviewer). However, the average detection rate for errors in the result input format by the three GPT models was only 51.9 %, indicating a relatively poor performance in this aspect. GPT models exhibited substantial to almost perfect agreement with SMLS in detecting total errors (kappa [min, max]: 0.778, 0.837). However, the agreement between GPT models and MLI was moderately lower (kappa [min, max]: 0.632, 0.696). When it comes to reading all 86 reports, GPT models showed obviously reduced reading time compared with SMLS or MLI (all p<0.001). Notably, our study also found the GPT-o1 mini model had better consistency of error identification than the GPT-o1 model, which was better than that of the GPT-4o model. The pairwise comparisons of the same GPT model’s outputs across three repeated runs showed almost perfect agreement (kappa [min, max]: 0.912, 0.996). GPT-o1 mini showed obviously reduced reading time compared with GPT-4o or GPT-o1(all p<0.001). Additionally, GPT-o1 significantly outperformed GPT-4o or o1 mini in providing accurate and reliable treatment recommendations (all p<0.0001).
Conclusions
The detection capability of some of medical laboratory report errors and the accuracy and reliability of treatment recommendations of GPT models was competent, especially, potentially reducing work hours and enhancing clinical decision-making.
Funding source: National College Students’ Innovation and Entrepreneurship Training Program
Award Identifier / Grant number: 202210368023
Funding source: Health Research Foundation of Anhui Province
Award Identifier / Grant number: AHWJ2023A20546
Funding source: Natural Science Foundation of Universities in Anhui Province
Award Identifier / Grant number: 2023AH051742
Award Identifier / Grant number: 2024AH051936
Award Identifier / Grant number: KJ2021A0835
Award Identifier / Grant number: KJ2021ZD0102
Funding source: College Students’ Innovation and Entrepreneurship Training Program in Anhui Province
Award Identifier / Grant number: S202210368042
Award Identifier / Grant number: S202310368052
Award Identifier / Grant number: S202310368121
Funding source: Natural Science Foundation of China
Award Identifier / Grant number: 82201195
Funding source: Natural Science Foundation of Wannan Medical College
Award Identifier / Grant number: WK2023ZZD20
Acknowledgments
The data analysis, article-editing and revising process, and article submission process received careful and kind guidance from Prof. Jianhua Wang, Fudan University Shanghai Cancer Center.
-
Research ethics: The study was approved by the Ethics Review Committee of the First Affiliated Hospital of Wannan Medical College (Approval No. 202327).
-
Informed consent: Not applicable for the retrospective study.
-
Author contributions: KJ and QC designed the study. WZ, YY, RS, WH, GF, and XL were designated as senior medical laboratory scientists and medical laboratory interns to identify errors. CW collected questions and ChatGPT responses. XX, GC, and JY assessed the responses produced by the ChatGPT models. QC analyzed and interpreted the data statistically as well as wrote the manuscript. KJ critically reviewed and improved the manuscript. All authors have accepted responsibility for the entire content of the manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: Not applicable.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: The present study was supported by the Natural Science Foundation of Universities in Anhui Province (grant no. KJ2021A0835, 2023AH051742, KJ2021ZD0102 and 2024AH051936), the Health Research Foundation of Anhui Province (grant no. AHWJ2023A20546), the National College Students’ Innovation and Entrepreneurship Training Program (grant no. 202210368023), the College Students’ Innovation and Entrepreneurship Training Program in Anhui Province (grant no. S202310368052, S202310368121, and S202210368042), Natural Science Foundation of China (grant no. 82201195), and Natural Science Foundation of Wannan Medical College (grant no. WK2024ZQNZ56).
-
Data availability: The raw data can be obtained on request from the corresponding author.
References
1. Miligy, DA. Laboratory errors and patient safety. Int J Health Care Qual Assur 2015;28:2–10. https://doi.org/10.1108/ijhcqa-10-2008-0098.Search in Google Scholar
2. Asmelash, D, Worede, A, Teshome, M. Extra-analytical clinical laboratory errors in Africa: a systematic review and meta-analysis. EJIFCC 2020;31:208–24.Search in Google Scholar
3. Esani, M, Faubion, D, Chen, L, Walker, L, Kuo, YF. Association of laboratory science education and certification with laboratory errors: the value of education and certification study. J Allied Health 2024;53:130–5.Search in Google Scholar
4. Sahni, NR, Carrus, B. Artificial intelligence in U.S. Health care delivery. N Engl J Med 2023;389:348–58. https://doi.org/10.1056/nejmra2204673.Search in Google Scholar PubMed
5. Jin, K, Li, Y, Wu, H, Tham, YC, Koh, V, Zhao, Y, et al.. Integration of smartphone technology and artificial intelligence for advanced ophthalmic care: a systematic review. Adv Ophthalmol Pract Res 2024;4:120–7. https://doi.org/10.1016/j.aopr.2024.03.003.Search in Google Scholar PubMed PubMed Central
6. Kang, D, Wu, H, Yuan, L, Shi, Y, Jin, K, Grzybowski, A. A beginner’s guide to artificial intelligence for ophthalmologists. Ophthalmol Ther 2024;13:1841–55. https://doi.org/10.1007/s40123-024-00958-3.Search in Google Scholar PubMed PubMed Central
7. Jin, K, Yuan, L, Wu, H, Grzybowski, A, Ye, J. Exploring large language model for next generation of artificial intelligence in ophthalmology. Front Med 2023;10:1291404. https://doi.org/10.3389/fmed.2023.1291404.Search in Google Scholar PubMed PubMed Central
8. Yan, Y, Huang, X, Jiang, X, Gao, Z, Liu, X, Jin, K, et al.. Clinical evaluation of deep learning systems for assisting in the diagnosis of the epiretinal membrane grade in general ophthalmologists. Eye (Lond) 2024;38:730–6. https://doi.org/10.1038/s41433-023-02765-9.Search in Google Scholar PubMed PubMed Central
9. Levine, DM, Tuwani, R, Kompa, B, Varma, A, Finlayson, SG, Mehrotra, A, et al.. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study. Lancet Digit Health 2024;6:e555–61. https://doi.org/10.1016/s2589-7500(24)00097-9.Search in Google Scholar PubMed
10. Schmidt, RA, Seah, JCY, Cao, K, Lim, L, Lim, W, Yeung, J. Generative large language models for detection of speech recognition errors in radiology reports. Radiol Artif Intell 2024;6:e230205. https://doi.org/10.1148/ryai.230205.Search in Google Scholar PubMed PubMed Central
11. Wu, SH, Tong, WJ, Li, MD, Hu, HT, Lu, XZ, Huang, ZR, et al.. Collaborative enhancement of consistency and accuracy in US diagnosis of thyroid nodules using large language models. Radiology 2024;310:e232255. https://doi.org/10.1148/radiol.232255.Search in Google Scholar PubMed
12. Liu, X, Wu, J, Shao, A, Shen, W, Ye, P, Wang, Y, et al.. Uncovering Language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res 2024;26:e51926. https://doi.org/10.2196/51926.Search in Google Scholar PubMed PubMed Central
13. Mendels, DA, Dortet, L, Emeraud, C, Oueslati, S, Girlich, D, Ronat, JB, et al.. Using artificial intelligence to improve COVID-19 rapid diagnostic test result interpretation. PNAS 2021;118:e2019893118. https://doi.org/10.1073/pnas.2019893118.Search in Google Scholar PubMed PubMed Central
14. Grzybowski, A, Jin, K, Wu, H. Challenges of artificial intelligence in medicine and dermatology. Clin Dermatol 2024;42:210–15. https://doi.org/10.1016/j.clindermatol.2023.12.013.Search in Google Scholar PubMed
15. Choi, MH, Kim, D, Bae, HG, Kim, A-R, Lee, M, Lee, K, et al.. Predictive performance of urinalysis for urine culture results according to causative microorganisms: an integrated analysis with artificial intelligence. J Clin Microbiol 2024;62:e0117524. https://doi.org/10.1128/jcm.01175-24.Search in Google Scholar PubMed PubMed Central
16. Hashimoto, DA, Witkowski, E, Gao, L, Meireles, O, Rosman, G. Artificial intelligence in anesthesiology: current techniques, clinical applications, and limitations. Anesthesiology 2020;132:379–94. https://doi.org/10.1097/aln.0000000000002960.Search in Google Scholar
17. Paranjape, K, Schinkel, M, Hammer, RD, Schouten, B, Nannan Panday, RS, Elbers, PWG, et al.. The value of artificial intelligence in laboratory medicine. Am J Clin Pathol 2021;155:823–31. https://doi.org/10.1093/ajcp/aqaa170.Search in Google Scholar PubMed PubMed Central
18. Padoan, A, Plebani, M. Flowing through laboratory clinical data: the role of artificial intelligence and big data. Clin Chem Lab Med 2022;60:1875–80. https://doi.org/10.1515/cclm-2022-0653.Search in Google Scholar PubMed
19. Howell, MD, Corrado, GS, DeSalvo, KB. Three epochs of artificial intelligence in Health care. JAMA 2024;331:242–4. https://doi.org/10.1001/jama.2023.25057.Search in Google Scholar PubMed
20. Hou, H, Zhang, R, Li, J. Artificial intelligence in the clinical laboratory. Clin Chim Acta 2024;559:119724. https://doi.org/10.1016/j.cca.2024.119724.Search in Google Scholar PubMed
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/cclm-2025-0089).
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial
- Macroprolactinaemia – some progress but still an ongoing problem
- Review
- Understanding the circulating forms of cardiac troponin: insights for clinical practice
- Opinion Papers
- New insights in preanalytical quality
- IFCC recommendations for internal quality control practice: a missed opportunity
- Genetics and Molecular Diagnostics
- Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models
- General Clinical Chemistry and Laboratory Medicine
- Pre-analytical phase errors constitute the vast majority of errors in clinical laboratory testing
- Improving the efficiency of quality control in clinical laboratory with an integrated PBRTQC system based on patient risk
- IgA-type macroprolactin among 130 patients with macroprolactinemia
- Prevalence and re-evaluation of macroprolactinemia in hyperprolactinemic patients: a retrospective study in the Turkish population
- Defining dried blood spot diameter: implications for measurement and specimen rejection rates
- Screening primary aldosteronism by plasma aldosterone-to-angiotensin II ratio
- Assessment of serum free light chain measurements in a large Chinese chronic kidney disease cohort: a multicenter real-world study
- Beyond the Hydrashift assay: the utility of isoelectric focusing for therapeutic antibody and paraprotein detection
- Direct screening and quantification of monoclonal immunoglobulins in serum using MALDI-TOF mass spectrometry without antibody enrichment
- Effect of long-term frozen storage on stability of kappa free light chain index
- Impact of renal function impairment on kappa free light chain index
- Standardization challenges in antipsychotic drug monitoring: insights from a national survey in Chinese TDM practices
- Potential coeliac disease in children: a single-center experience
- Vitamin D metabolome in preterm infants: insights into postnatal metabolism
- Candidate Reference Measurement Procedures and Materials
- Development of commutable candidate certified reference materials from protein solutions: concept and application to human insulin
- Reference Values and Biological Variations
- Biological variation of serum cholinesterase activity in healthy subjects
- Hematology and Coagulation
- Diagnostic performance of morphological analysis and red blood cell parameter-based algorithms in the routine laboratory screening of heterozygous haemoglobinopathies
- Cancer Diagnostics
- Promising protein biomarkers for early gastric cancer: clinical performance of combined detection
- Infectious Diseases
- The accuracy of presepsin in diagnosing neonatal late-onset sepsis in critically ill neonates: a prospective study
- Corrigendum
- The Unholy Grail of cancer screening: or is it just about the Benjamins?
- Letters to the Editor
- Analytical validation of hemolysis detection on GEM Premier 7000
- Reconciling reference ranges and clinical decision limits: the case of thyroid stimulating hormone
- Contradictory definitions give rise to demands for a right to unambiguous definitions
- Biomarkers to measure the need and the effectiveness of therapeutic supplementation: a critical issue
Articles in the same Issue
- Frontmatter
- Editorial
- Macroprolactinaemia – some progress but still an ongoing problem
- Review
- Understanding the circulating forms of cardiac troponin: insights for clinical practice
- Opinion Papers
- New insights in preanalytical quality
- IFCC recommendations for internal quality control practice: a missed opportunity
- Genetics and Molecular Diagnostics
- Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models
- General Clinical Chemistry and Laboratory Medicine
- Pre-analytical phase errors constitute the vast majority of errors in clinical laboratory testing
- Improving the efficiency of quality control in clinical laboratory with an integrated PBRTQC system based on patient risk
- IgA-type macroprolactin among 130 patients with macroprolactinemia
- Prevalence and re-evaluation of macroprolactinemia in hyperprolactinemic patients: a retrospective study in the Turkish population
- Defining dried blood spot diameter: implications for measurement and specimen rejection rates
- Screening primary aldosteronism by plasma aldosterone-to-angiotensin II ratio
- Assessment of serum free light chain measurements in a large Chinese chronic kidney disease cohort: a multicenter real-world study
- Beyond the Hydrashift assay: the utility of isoelectric focusing for therapeutic antibody and paraprotein detection
- Direct screening and quantification of monoclonal immunoglobulins in serum using MALDI-TOF mass spectrometry without antibody enrichment
- Effect of long-term frozen storage on stability of kappa free light chain index
- Impact of renal function impairment on kappa free light chain index
- Standardization challenges in antipsychotic drug monitoring: insights from a national survey in Chinese TDM practices
- Potential coeliac disease in children: a single-center experience
- Vitamin D metabolome in preterm infants: insights into postnatal metabolism
- Candidate Reference Measurement Procedures and Materials
- Development of commutable candidate certified reference materials from protein solutions: concept and application to human insulin
- Reference Values and Biological Variations
- Biological variation of serum cholinesterase activity in healthy subjects
- Hematology and Coagulation
- Diagnostic performance of morphological analysis and red blood cell parameter-based algorithms in the routine laboratory screening of heterozygous haemoglobinopathies
- Cancer Diagnostics
- Promising protein biomarkers for early gastric cancer: clinical performance of combined detection
- Infectious Diseases
- The accuracy of presepsin in diagnosing neonatal late-onset sepsis in critically ill neonates: a prospective study
- Corrigendum
- The Unholy Grail of cancer screening: or is it just about the Benjamins?
- Letters to the Editor
- Analytical validation of hemolysis detection on GEM Premier 7000
- Reconciling reference ranges and clinical decision limits: the case of thyroid stimulating hormone
- Contradictory definitions give rise to demands for a right to unambiguous definitions
- Biomarkers to measure the need and the effectiveness of therapeutic supplementation: a critical issue