Startseite Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?

  • Kazuya Mizuta , Takanobu Hirosawa ORCID logo EMAIL logo , Yukinori Harada und Taro Shimizu
Veröffentlicht/Copyright: 12. März 2024
Diagnosis
Aus der Zeitschrift Diagnosis Band 11 Heft 3

Abstract

Objectives

The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. While there has been significant emphasis on creating lists of differential diagnoses, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in these lists. This short communication aimed to assess the accuracy of ChatGPT-4 in evaluating lists of differential diagnosis compared to medical professionals’ assessments.

Methods

We used ChatGPT-4 to evaluate whether the final diagnosis was included in the top 10 differential diagnosis lists created by physicians, ChatGPT-3, and ChatGPT-4, using clinical vignettes. Eighty-two clinical vignettes were used, comprising 52 complex case reports published by the authors from the department and 30 mock cases of common diseases created by physicians from the same department. We compared the agreement between ChatGPT-4 and the physicians on whether the final diagnosis was included in the top 10 differential diagnosis lists using the kappa coefficient.

Results

Three sets of differential diagnoses were evaluated for each of the 82 cases, resulting in a total of 246 lists. The agreement rate between ChatGPT-4 and physicians was 236 out of 246 (95.9 %), with a kappa coefficient of 0.86, indicating very good agreement.

Conclusions

ChatGPT-4 demonstrated very good agreement with physicians in evaluating whether the final diagnosis should be included in the differential diagnosis lists.


Corresponding author: Takanobu Hirosawa, MD, PhD, Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Simotsuga-gun, Tochigi, 321-0293, Japan, Phone: +81 282 86 1111, Fax: +81 282 86 4775, E-mail:

Acknowledgments

This study was made possible using the resources from the Department of Diagnostic and Generalist Medicine, Dokkyo Medical University.

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Competing interests: The authors state no conflict of interest.

  5. Research funding: None declared.

  6. Data availability: Not applicable.

References

1. Yang, D, Fineberg, HV, Cosby, K. Diagnostic excellence. JAMA 2021;326:1905–6. https://doi.org/10.1001/jama.2021.19493.Suche in Google Scholar PubMed

2. Singh, H, Connor, DM, Dhaliwal, G. Five strategies for clinicians to advance diagnostic excellence. BMJ 2022;376:e068044. https://doi.org/10.1136/bmj-2021-068044.Suche in Google Scholar PubMed

3. Meyer, AND, Singh, H. The path to diagnostic excellence includes feedback to calibrate how clinicians think. JAMA 2019;321:737–8. https://doi.org/10.1001/jama.2019.0113.Suche in Google Scholar PubMed

4. Hirosawa, T, Kawamura, R, Harada, Y, Mizuta, K, Tokumasu, K, Kaji, Y, et al.. ChatGPT-Generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform 2023;11:e48808. https://doi.org/10.2196/48808.Suche in Google Scholar PubMed PubMed Central

5. Berg, HT, van Bakel, B, van de Wouw, L, Jie, KE, Schipper, A, Jansen, H, et al.. ChatGPT and generating a differential diagnosis early in an emergency department presentation. Ann Emerg Med 2024;83:83–6. https://doi.org/10.1016/j.annemergmed.2023.08.003.Suche in Google Scholar PubMed

6. Hirosawa, T, Harada, Y, Yokose, M, Sakamoto, T, Kawamura, R, Shimizu, T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Publ Health 2023;20:3378. https://doi.org/10.3390/ijerph20043378.Suche in Google Scholar PubMed PubMed Central

7. Fleiss, JL, Levin, B, Paik, MC. Statistical methods for rates and proportions. New York: John Wiley & sons; 2003.10.1002/0471445428Suche in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/dx-2024-0027).


Received: 2024-02-09
Accepted: 2024-02-22
Published Online: 2024-03-12

© 2024 Walter de Gruyter GmbH, Berlin/Boston

Artikel in diesem Heft

  1. Frontmatter
  2. Editorial
  3. The growing threat of hijacked journals
  4. Review
  5. Effects of SNAPPS in clinical reasoning teaching: a systematic review with meta-analysis of randomized controlled trials
  6. Mini Review
  7. Diagnostic value of D-dimer in differentiating multisystem inflammatory syndrome in Children (MIS-C) from Kawasaki disease: systematic literature review and meta-analysis
  8. Opinion Papers
  9. Masquerade of authority: hijacked journals are gaining more credibility than original ones
  10. FRAMED: a framework facilitating insight problem solving
  11. Algorithms in medical decision-making and in everyday life: what’s the difference?
  12. Original Articles
  13. Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4
  14. Comparative analysis of diagnostic accuracy in endodontic assessments: dental students vs. artificial intelligence
  15. Assessing the Revised Safer Dx Instrument® in the understanding of ambulatory system design changes for type 1 diabetes and autism spectrum disorder in pediatrics
  16. The Big Three diagnostic errors through reflections of Japanese internists
  17. SASAN: ground truth for the effective segmentation and classification of skin cancer using biopsy images
  18. Computable phenotype for diagnostic error: developing the data schema for application of symptom-disease pair analysis of diagnostic error (SPADE)
  19. Development of a disease-based hospital-level diagnostic intensity index
  20. HbA1c and fasting plasma glucose levels are equally related to incident cardiovascular risk in a high CVD risk population without known diabetes
  21. Short Communications
  22. Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?
  23. Analysis of thicknesses of blood collection needle by scanning electron microscopy reveals wide heterogeneity
  24. Letters to the Editor
  25. For any disease a human can imagine, ChatGPT can generate a fake report
  26. The dilemma of epilepsy diagnosis in Pakistan
  27. The Japanese universal health insurance system in the context of diagnostic equity
  28. Case Report – Lessons in Clinical Reasoning
  29. Lessons in clinical reasoning – pitfalls, myths, and pearls: a case of tarsal tunnel syndrome caused by an intraneural ganglion cyst
Heruntergeladen am 11.9.2025 von https://www.degruyterbrill.com/document/doi/10.1515/dx-2024-0027/html
Button zum nach oben scrollen