Home Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4
Article Open Access

Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4

  • Joe M. Bridges EMAIL logo
Published/Copyright: May 7, 2024

Abstract

Objectives

Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro.

Methods

This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis’s rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis.

Results

Both systems produced comprehensive differential diagnosis lists. Isabel Pro’s list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %).

Conclusions

This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.

Introduction

Generative Pre-Trained Large Language Models (LLMs), such as ChatGPT4, became available in the spring of 2023. These systems take artificial intelligence from natural language processing and pattern recognition to a whole new level of human-like performance. The growth in the use of these models has been remarkable [1]. The enthusiasm for these models has spread to their use in many facets of medicine, including diagnosis [2], [3], [4], [5]. However, validation studies of these systems have been limited to a few articles with limited datasets comparing the performance to clinicians or medical journal readers [2], [3], [4]. Reference [2] analyzed two cases with ChatGPT responses evaluated by senior clinicians. Reference [3] evaluated 100 cases from the Journal of Neuroradiology each with a clinical consensus diagnosis. Reference [4] used clinical vignettes used to test medical students and residents. Each article suggested the need for additional validation if clinical use were to be made. Before the arrival of ChatGPT4, several studies have encouraged the use of computerized diagnostic decision support systems (CDDSSs) to improve diagnostic accuracy [5], [6], [7], [8]. The use of CDDSSs, particularly Isabel Pro (itself a system employing artificial intelligence in the form of natural language processing), has been shown to improve clinicians’ diagnostic accuracy in several studies [9], [10], [11, 15]. This study aims to validate the LLM, ChatGPT4, as a CDDSS by comparing its performance to Isabel Pro’s using a set of 201 cases from the NEJM and the library of Dr. Charles Friedman, each case with a confirmed clinical consensus diagnosis. This dataset is roughly three to four times larger than any previous study and covers numerous medical specialties and patient ages. The study also analyzes a differential diagnosis listing longer than previous studies, many limited to the top 10 presentations.

Research question

Given that studies have shown a statistically significant improvement in clinicians’ diagnostic accuracy using Isabel Pro [14, 15], does the LLM system, ChatGPT4 produce a greater number of accurate diagnoses ranked higher in presentation than Isabel Pro?

Methods

Software systems

The study employs a commercially available diagnostic decision support system, Isabel Pro, and an LLM artificial intelligence system, ChatGPT4, to produce a differential diagnosis listing.

Isabel Pro

Founded in 1999 by Jason Maude, Isabel Healthcare, Ltd. produces medical diagnosis decision support systems for physicians and patients to improve diagnostic accuracy. One product is Isabel Pro, a CDDSS that takes minimal input of the patient’s age, sex at birth, pregnancy status if female, travel history if outside North America, and a drop-down list from which presenting symptoms can be selected or entered manually and used in a proprietary search algorithm to match the symptoms against disease conditions in a proprietary high-quality database of medical references, including the Merck Manual Professional, Cochrane Reports, and similar reference texts. The program lists the diagnoses in order of the frequency with which the symptoms appear in the reference dataset for a particular disease condition. Isabel Pro’s reference dataset is updated monthly. The study by Riches et al. [12] documented the performance of Isabel Pro against several CDDSSs, and the study by Sibbald et al. [14] documented the improvement (7–8%) in diagnostic accuracy of clinicians when using Isabel Pro.

ChatGPT4

ChatGPT4 is the most recent version of a Generative Pre-trained Transformer (GPT) developed by OpenAI and released on March 14, 2023. The company reports that the system can read, analyze, and generate up to 25,000 words and write code in all major programming languages. An LLM artificial intelligence system relies on a training set from which the system extracts information to perform various tasks. For this study, the task was producing a differential diagnosis. ChatGPT4 is trained on Common Crawl, a publicly available dataset and one of the most extensive text datasets. While not trained explicitly for medical diagnosis, ChatGPT4 accesses medical textbooks, medical websites, medical papers, and other reference material up to its most recent training date, January 2022.

Datasets

This study used 201 cases with a confirmed clinical consensus diagnosis. The study used a set of 36 cases from Dr. Charles P. Friedman [13], used in the Sibbald et al. study [14] and by the author with Dr. Friedman’s consent in the author’s Translational Project paper [15]. The remaining cases came from the NEJM. A recent paper by Fritz et al. used 50 cases from the NEJM [16]. A paper co-authored by Dr. Mark Graber used 10 cases from the NEJM [10]. Finally, the study used 105 cases from the NEJM, previously published on the Isabel Healthcare website. This dataset is three to four times larger than any previous study and covers a wide range of patients by sex and age and a wide variety of disease conditions. Table 1 shows the Study Dataset Demographics with a male/female ratio of 1.25, 26 % age 50–64, 20 % 65-over, 18 % 30–39, 13 % 40–49, and 12 % 17–29. Medical specialties comprising 65 % of the cases are infectious diseases, neoplasms, rheumatology, respiratory, cardiovascular, hematology, endocrinology, and gastrointestinal.

Table 1:

Study dataset demographics. Sex at birth and medical specialty by age range.

Age range of patients, years
Total 29 days–1 year 1–5 6–12 13–16 17–29 30–39 40–49 50–64 65-Over

Sex

Female 89 0 0 4 3 13 15 15 18 21
Male 112 5 3 2 1 12 22 12 35 20
Total 201 5 3 6 4 25 37 27 53 41

Medical specialty

Infectious diseases 41 1 1 0 0 7 9 2 10 11
Neoplasms 26 0 2 1 0 0 6 3 8 6
Rheumatology 13 0 0 0 1 1 2 4 3 2
Respiratory 11 1 0 1 0 1 2 2 2 2
Cardiovascular 11 0 0 0 0 0 1 0 8 2
Hematology 10 0 0 0 1 2 1 2 3 1
Endocrinology 9 0 0 0 1 1 1 1 5 0
Gastrointestinal 9 0 0 2 0 2 2 1 1 1
Neurology 8 0 0 0 0 3 3 1 0 1
Neuromuscular 6 0 0 0 1 0 1 1 2 1
Orthopedics 6 0 0 0 0 1 1 2 2 0
Immunology 6 0 0 1 0 0 0 1 1 3
Metabolic diseases 6 0 0 0 0 0 2 1 1 2
Vascular diseases 5 0 0 0 0 0 0 1 1 3
Gynecology 4 1 0 0 0 0 1 1 1 0
Psychology 4 2 0 0 0 0 0 1 1 0
Toxicology 4 0 0 0 0 0 1 2 1 0
Ear, nose and throat 3 0 0 0 0 1 1 0 0 1
Environmental 2 0 0 0 0 0 0 0 0 2
Obstetrics 2 0 0 0 0 1 0 1 0 0
Ophthalmology 2 0 0 0 0 1 0 0 1 0
Urology 2 0 0 0 0 1 0 0 0 1
Allergies 1 0 0 0 0 0 0 0 0 1
Neoplasm/rheumatology 1 0 0 0 0 0 0 0 0 1
Neurology/respiratory 1 0 0 0 0 0 0 0 1 0
Totals 201 5 3 6 4 25 37 27 53 41

Since the NEJM cases are in the public domain, an unfair advantage might accrue to ChatGPT4’s benefit. In none of the 201 cases was the NEJM listed as a source for the diagnosis presented. When requested to find a specific NEJM case, ChatGPT4’s response was completely inaccurate as to case information and diagnosis (see Appendix 1).

Input and output

Each system was given identical information on each case, and each system was asked to produce a differential diagnosis listing in order of likelihood or match of symptoms. Isabel Pro’s input is structured. The output is a similarly structured presentation, with the diagnoses listed in the sequence for which the presentation symptoms appear most frequently in the reference dataset. The input information was the patient’s age or age range, the patient’s sex assigned at birth, pregnancy status (if female), travel history if outside North America, and the patient’s presenting symptoms.

ChatGPT4 employs an inquiry format applicable to unstructured requests and the possibility of recurrent exchanges on the same query topic. This study used a structured request format for ChatGPT4, mirroring the Isabel Pro input format. Each request to ChatGPT4 ended with a requirement to list the diagnoses in order of likelihood of the condition, produce at least 40 diagnoses, and cite all reference material consulted with a complete citation, DOI, and the basis for the ranking.

NEJM Case 17-2008, is an example of input information. Both systems produced the correct diagnosis, “Renal cell carcinoma”, as the first differential presented. This procedure was followed in each of the 201 cases to record the correct diagnosis, if produced, and its rank within the differential listing produced by each system. If the correct diagnosis failed to appear, the diagnosis was listed as N/A and ranked 50 for analytical purposes. Reference citations and DOIs were confirmed, marking them as correct, incorrect, not found, or not produced.

Statistical methods

Table 3 tabulates the total number of correct diagnoses for each system, the number of cases for which a system’s correct diagnosis ranking outranked the other system, the number of cases for which the systems produced the same rank, and the number of cases for which neither system produced a correct diagnosis. The study compared the systems using the Mean Reciprocal Rank (MRR), defined as the average of the inverse of the rank for the total number of cases. Since there is only one correct answer for each case, the MRR measures the average rank at which the correct answer would appear for the set of requests evaluated. The study also ranked the correct diagnoses for each system according to the frequency with which the correct diagnosis was equal to or less than the ranks 1, 5, 10, 20, 30, and 40. The study referred to this figure as Recall at Rank, calculated by recording the number 1 for each rank equal to or less than the evaluated rank and averaging the sum. Recall at Rank is the frequency with which a correct diagnosis appears at a ranking equal to or less than the evaluated rank.

Upon completing all 201 cases, a Wilcoxon Signed Rank Sum calculation, a nonparametric hypothesis test, assessed whether either system’s Recall at Rank showed a statistically significant difference (see Table 4).

Ethical aspects

The cases evaluated contain no personal health information. All NEJM cases are publicly available on the publication’s website, NEJM.org. The author used the cases from Dr. Charles Friedman with permission. The author reports no conflicts of interest, and the study required no outside funding sources.

Results

Qualitative results

Table 2 contains an alphabetical list of the diagnoses missed by each system. As might be expected, the missed diagnoses are often conditions with very low, if not rare, incidence or where symptoms develop over time.

Table 2:

Missed diagnoses.

Diagnoses missed by Isabel Pro Diagnoses missed by ChatGPT4
  1. Adenovirus infection, disseminated

  2. Anemia due to parvovirus B19 infection and autoimmune hemolytic anemia

  3. Aortic dissection

  4. Blastomycosis

  5. Burkitt-like lymphoma with 11q aberrations

  6. Cardiac amyloidosis

  7. Colonic mass due to infection with an intravascular nematode, most likely Angiostrongylus costaricensis

  8. cryoglobulinemia

  9. Cryptococcal meningitis

  10. Gaucher’s disease

  11. Granulomatosis with polyangiitis

  12. Hairy cell leukemia

  13. Hypokalemic periodic paralysis

  14. Miliary (disseminated) TB

  15. Non-Hodgkins lymphoma

  16. Ocular syphilis

  17. Papulonecrotic tuberculid (cutaneous tuberculosis)

  18. Phenochromocytoma

  19. Polyarteritis nodosa

  20. Porphyria (cutnea tarda)

  21. Post-traumatic stress disorder

  22. Pulmonary artery aneurysm that was most likely due to the Hughes–Stovin syndrome

  23. Renal lymphangiectasis (hygroma renalis),

  24. Rickettsiosis of the spotted fever group

  25. Silicosis

  26. The drug reaction with eosinophilia and systemic symptoms (DRESS) syndrome, associated with human herpesvirus 6 infection or reactivation

  27. Silicosis

  28. The drug reaction with eosinophilia and systemic symptoms (DRESS) syndrome, associated with human herpesvirus 6 infection or reactivation

  1. Acrodermatitis enteropathica

  2. Acute cellulitis due to P. multocida, followed by glandular and systemic infection with F. tularensis (tularemia), both after a cat bite

  3. Adenocarcinoma of the lung (non-small-cell lung cancer)

  4. Anaplasma phagocytophilum infection (human granulocytotropic anaplasmosis)

  5. Autoimmune lymphoproliferative syndrome (ALPS) due to a germline FAS mutation

  6. Babesiosis

  7. Brown tumor associated with osteitis fibrosa cystica

  8. Catastrophic antiphospholipid syndrome

  9. Cushing’s syndrome due to ectopic secretion of adrenocorticotropic hormone from small-cell lung cancer

  10. Ductal adenocarcinoma of the prostate

  11. Gamma heavy-chain disease

  12. Gastric mucormycosis caused by rhizopus species

  13. Gaucher’s disease

  14. Hairy cell leukemia

  15. Hemolytic uremic syndrome

  16. Human granulocytic anaplasmosis

  17. Hypersensitivity pneumonitis

  18. Infected intracardiac foreign body

  19. Inflammatory cerebral amyloid angiopathy

  20. Intestinal tuberculosis

  21. Malignant melanoma with features resembling a spitz nevus

  22. Mucormycosis

  23. Mucous-membrane pemphigoid

  24. Multiple Tarlov (perineurial) cysts

  25. Osteomalacia

  26. Pernicious anemia

  27. Polyarteritis nodosa

  28. Primary lung adenocarcinoma with metastasis involving the mediastinal lymph nodes

  29. Pulmonary artery aneurysm that was most likely due to the Hughes–Stovin syndrome

  30. Renal lymphangiectasis (hygroma renalis),

  31. Silicosis

  32. Subarachnoid hemorrhage

  33. Susac’s syndrome

  34. Susac’s syndrome (retinocochleocerebral vasculopathy)

  35. Thrombotic thrombocytopenic purpura

  36. Urothelial-type adenocarcinoma of the prostatic urethra

Quantitative results

Overall, the two systems performed comparably on the 201 cases, with Isabel Pro producing 175 correct diagnoses (87.1 %) while ChatGPT4 produced 165 correct diagnoses (82.1 %). In that process, ChatGPT4 produced 83 diagnoses, which outranked the Isabel Pro diagnoses, while Isabel Pro and ChatGPT4 produced the same ranking in 37 cases. ChatGPT4 outperformed Isabel Pro in producing correct diagnoses ranked 1, top 5, and top 10. Isabel Pro more often returned the correct diagnosis within the top 20, top 30, and top 40. ChatGPT4 more often returned the correct diagnosis at a ranking of 1 compared to Isabel Pro. Isabel Pro outperformed ChatGPT4 within the top 20 and ultimately returned more correct diagnoses (see Table 3). The MRR measures the average rank at which the system will return the single correct diagnosis, making it a relevant yardstick for these two systems. ChatGPT4 produced an MRR of 0.431, the correct diagnosis likely appearing at an average ranking of 2.32, compared to Isabel Pro at 0.389, the correct diagnosis likely appearing at an average ranking of 2.57. Both systems presented the correct diagnosis at an average ranking of three or less (Figures 1 and 2).

Table 3:

Diagnostic accuracy – comparison of results by system.

Total cases 201 Total cases 201
Isabel Pro out-ranks ChatGPT4 75 ChatGPT4 out-ranks Isabel Pro 83
Both correct tied 37 Both correct tied 37
Neither right 6 Neither right 6
Isabel Pro correct diagnoses 175 ChatGPT4 correct diagnoses 164
Figure 1: 
Diagnostic retrieval accuracy comparison by recall at rank.
Figure 1:

Diagnostic retrieval accuracy comparison by recall at rank.

Figure 2: 
Case sequence recall at rank comparison: Isabel Pro vs. ChatGPT4.
Figure 2:

Case sequence recall at rank comparison: Isabel Pro vs. ChatGPT4.

System equivalence

The Wilcoxon Signed Rank Sum Test assesses the median difference between two paired, nonparametric samples. This test comparing the Recall at Rank for Isabel Pro vs. ChatGT4 is a typical nonparametric sample where the null hypothesis postulates no difference between the two samples. To reject the null hypothesis, the Wilcoxon Signed Rank Sum Test would return a p-value equal to or less than 0.05. Table 4 shows the results of all three hypotheses, each exceeding a p-value of 0.05 by a wide margin. The Wilcoxon test fails to reject the Null Hypothesis and the data provides no evidence to suggest that the score distributions are due to chance. If the sample estimate of the effect size is accurate in the general population (r=0.01), the analysis would require more than 80,000 cases to produce a p-value of significance allowing the conclusion that the systems are not equivalent. Without sufficient cases, we cannot definitively reject or accept the null hypothesis.

Table 4:

Wilcoxon ranked sum hypothesis test.

Null hypothesis Alternative hypothesis W z p r
The variable ChatGPT4 rank has smaller or equal values than the variable Isabel Pro rank The variable ChatGPT4 rank has larger values than the variable Isabel Pro rank 6202.5 −0.14 0.446 0.01
The variable Isabel Pro rank has smaller or equal values than the variable ChatGPT4 rank The variable Isabel Pro rank has larger values than the variable ChatGPT4 rank 6202.5 −0.14 0.554 0.01
There is no difference between the variables Isabel Pro rank and ChatGPT4 rank There is a difference between the variables Isabel Pro rank and ChatGPT4 rank 6202.5 −0.14 0.892 0.01

Diagnosis ranking explanation

A particularly concerning issue is the question, “How was a diagnosis determined?” Isabel Pro uses the frequency with which presenting symptoms appear in the medical reference database to rank the differential listing. This study asked ChatGPT4 to list the diagnoses in order of likelihood, to cite the references consulted in producing each diagnosis, and to state the basis for the ranking of each diagnosis. The responses, however, were short and inadequate to explain the mechanism for producing and ranking a diagnosis. The responses to the request “provide the basis for the ranking of the diagnosis” were uniformly brief, such as “It is often the case that … ” followed by a statement that a symptom “often” indicates a particular disease condition. In almost every response, only one reference was listed, so no accurate conclusion could be drawn about the mechanism for ranking the differential.

Reference ‘hallucination’

A study by Walters and Wilder examined the propensity for ChatGPT to “hallucinate” or fabricate bibliographic references [17]. For this study, the author searched the literature to find and confirm the validity of both the reference and the DOI for each reference cited by ChatGPT4. While 145 references were confirmed correct (87.9 %), only 52 DOIs were correct (31.5 %).

Hallucination, or more correctly, fabrication, falls into two categories: First, the article was very smoothly cited, using authors, journal, volume, issue, and page numbers, but much of the information was incorrect. Second, the article cited could not be found using a very diligent internet search. As mentioned, ChatGPT4 produced only 145 (87.9 %) correct references, determined to be correct only by a thorough internet search. ChatGPT4 did not indicate which reference might be correct without searching every reference. Confidence in the accuracy of a given diagnosis is significantly eroded thereby.

Discussion

When asked to produce a list of diagnoses based on demographics and presenting symptoms, the diagnostic accuracy of IsabelPro and ChatGPT4 was very close. While IsabelPro’s reference dataset and ranking algorithm are proprietary, the reference dataset for Isabel Pro is derived from the most respected sources and has been updated monthly for more than 29 years. The ranking system is straightforward and clearly stated as the frequency of presenting symptoms in the literature for that disease condition. In contrast, ChatGPT4’s ranking methodology is unknown, and the system did not offer a satisfactory explanation for the basis of its ranking. The basis by which ChatGPT4 posts and ranks its differential is unclear, and the references it cites to accompany those diagnoses are far too often fabricated.

Regarding response speed, the differential diagnosis listing produced by Isabel Pro appears within seconds, most often immediately upon submission, a significant feature in today’s hurried primary care practices. ChatGPT4, on the other hand, produces its differential diagnosis listing by typing the responses in sequences of roughly 10 diagnoses. This process also requires continuing requests for additions to the list, a time-consuming process that can take up to 10 minutes or longer. Voice commands are modestly helpful in the speed of retrieval.

Rather than a relatively small group of clinicians or journal readers, the study compares ChatGPT4 to a computerized diagnostic decision support system, Isabel Pro, shown in several studies to improve clinicians’ diagnostic accuracy by 7–9 % [10, 11]. The dataset of cases is drawn from highly regarded sources (New England Journal of Medicine and the University of Michigan Medical School) and is more extensive than previous studies by a factor of three to four. Compared to previous studies, the dataset covers a broader range of medical specialties and patient sex and age. Each case carries a confirmed clinical consensus diagnosis for validation purposes. The study employs statistically significant comparison methods to evaluate the performance of the two systems – Mean Reciprocal Rank, Recall at Rank, and Wilcoxon Signed Rank Sum Test – and extends the comparisons beyond only the top 10 diagnoses presented. The study highlights a significant weakness of ChatGPT4 in the fabrication of references, a failing that should preclude its use in clinical practice until corrected.

Estimates of disease conditions are as high as 10,000, with symptoms estimated at 200. Even a dataset of 201 cases is unlikely to provide the ultimate diagnostic challenges. For example, the NEJM cases are diagnostically challenging, but a new case is presented every two weeks with little or no duplication of disease conditions over several decades. This study does not address the usage of either of these systems in routine clinical practice, nor does any previous study.

Previous studies have assessed the performance of ChatGPT4 compared to clinicians or journal readers but have not evaluated what computerized diagnostic decision support system might be a better choice for the diagnosing clinician. The implicit assumption in these studies is that ChatGPT4 is the option of choice for computerized assistance. Earlier studies have often limited the differential listing to the top 10, an arbitrary limit that, in the author’s opinion, unnecessarily hampers the diagnostic process, reducing the diagnostic accuracy of Isabel Pro and ChatGPT4 to 65 and 69 % in the top 10 from 78 and 76 % in the top 20, and 87 and 82 % in the top 40. A differential listing of 40 rather than 10 offers a 33 % improvement in diagnostic accuracy. While considering a list of differential diagnoses with more than 20 possibilities might discourage a novice, ruling in or out many alternatives is a task at which the experienced physician is quite adept [18].

The main takeaway from this study is concern about the process of deriving the differential diagnostic listing by ChatGPT4. The process is a “black box” that is unstated and unclear and often fabricates references, suggesting great caution in using the system for diagnosis. The fabrication of references is a relatively common complaint from those using and studying AI systems such as ChatGPT4 [17, 19], but in diagnosis, the fabrication of references is a serious failing that cannot be tolerated. ChatGPT4, or similar AI systems, are unlikely to be used by clinicians until and unless the system is specifically trained on the best medical reference sources and validated on enough cases to confirm its diagnostic accuracy. The individual clinician using Isabel Pro can click on a given diagnosis and be taken instantly to the Merck Manual Professional or to whatever medical reference source is chosen. Even a dozen or more diagnostic alternatives could be verified in just a few minutes. Not so with ChatGPT4. ChatGPT4 unequivocally fabricates references even when the accurate diagnosis is known and presented. One can only imagine the clinical difficulty in relying on such a system when diagnosing a yet unknown condition.

Both systems failed to diagnose several cases correctly. A future study might concentrate on those cases to determine the causes and propose improvements to address those failings. A future study might also concentrate on rare diseases to determine the capability of these systems to diagnose such conditions, a capability that might be most helpful to the primary care physician. Methods of integrating these systems into routine clinical practice were not a goal of this study, but smooth integration into the clinical workflow is a crucial system component yet to be solved. For ChatGPT4, usage in clinical diagnosis will be unlikely until the issue of fabricating references is resolved in favor of accurate and unequivocal citations.

Conclusions

Isabel Pro and ChatGPT4 displayed roughly comparable performance in producing differential diagnosis lists that contained the correct diagnosis. The diagnostic retrieval accuracy of both systems approached 90 %. Both Isabel Pro and ChatGPT4 have the elements needed for a CDDSS to warrant use by clinicians: retrieval of accurate diagnoses, speedy response to inquiries, and presentation of accurate and comprehensive differential diagnosis lists. Both systems suffer from the requirement for re-entry of patient demographics and presenting symptoms to produce the differential.

Isabel Pro is a finely crafted system that is easy to use, fast, and accurate and allows for quickly ruling diagnoses in or out by linking each presented diagnosis to the best medical reference for rapid consultation. Isabel Pro is commercially available, relatively inexpensive, and time-tested for over 25 years.

ChatGPT4 showed impressive performance in diagnosing the cases used in this study despite no explicit training on the best and most current medical reference sources. However, the methods used by the LLM, the “black box” aspect of the system, is a feature that is unappealing to the clinician and its frequent fabrication of references limits trust in the accuracy of the results. This study attempted to address this issue by requesting that the system cite the reference consulted in producing the diagnosis with a complete citation, DOI, and basis for the ranking. While helpful on many diagnoses, the systems often fabricated citations and, even more often, fabricated DOIs. These failings cast doubt on a clinician’s willingness to use the results confidently in the clinical setting where time is short and the diagnosis is yet to be confirmed. The time required to wait for the sequential typing of the differential listing and the repeated requests to complete the listing will hinder the system’s use in clinical settings.

This study highlights and validates the promise of CDDSSs in reducing diagnostic inaccuracy in medicine. This study also highlights and validates the promising future of the Large Language Model form of artificial intelligence in diagnosis. However, until the advent of an LLM trained on and with continuous access to the best and most recent medical reference material and until the issue of reference citations resolves in favor of strict and unequivocal accuracy, clinicians will likely make cautious use of artificial intelligence systems in diagnosis, if at all.

CDDSSs are not perfect, but studies show they improve clinicians’ diagnostic accuracy regardless of experience level. The search for higher accuracy in a CDDSS is continuing and should continue since diagnostic accuracy among physicians in the US is estimated to be only 85–95 %, suggesting 12 million inaccurate diagnoses annually at the low end of the estimate. Anything that can significantly improve the accuracy of what is arguably the most challenging task undertaken by humans and one that, if incorrect, results in patient harm, even death, is well worth pursuing with as much effort as possible.


Corresponding author: Joe M. Bridges, DHI, Adjunct Assistant Professor, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street UCT600, Houston, TX 77030, USA, E-mail:

Acknowledgments

The author wishes to acknowledge Dr. Xiaoqian Jiang, Professor, Dr. Han Chen, Associate Professor, and Dr. Kirk Roberts, Associate Professor, for their assistance in statistical comparison methods.

  1. Research ethics: This study required no funding source. The author notes no conflicts of interest. The study included no Personal Health Information in any of the cases. No ethical approvals were necessary.

  2. Informed consent: Not applicable.

  3. Author contributions: The author has accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Competing interests: The author states no conflict of interest.

  5. Research funding: None declared.

  6. Data availability: The raw data can be obtained on request from the corresponding author.

Appendix 1

Please find New England Journal of Medicine Case 2-2008.

ChatGPT

I found a reference to Case 2-2008 from the New England Journal of Medicine (NEJM) related to rheumatic mitral stenosis associated with partial anomalous pulmonary venous return (VIVO). However, for detailed information or to access the full article, you would need to visit the NEJM’s website or access their archives, as direct links or full article contents were not provided in the search results.

Was there a final diagnosis in the reference you found and, if so, what was it?

ChatGPT

Yes, the final diagnosis in Case 2-2008 from the New England Journal of Medicine was “rheumatic mitral stenosis associated with partial anomalous pulmonary venous return” (VIVO). This diagnosis points to a condition where the mitral valve is narrowed (mitral stenosis), typically due to rheumatic fever, and is accompanied by a congenital heart defect where some of the pulmonary veins (which carry oxygenated blood from the lungs back to the heart) are connected to the heart in an abnormal location (partial anomalous pulmonary venous return).

However, New England Journal of Medicine Case 2-2008 is actually:

Case 2-2008: a 38-year-old woman with postpartum visual loss, shortness of breath, and renal failure, with a final diagnosis of:

Postpartum renal failure due to thrombotic microangiopathy, associated with antiphospholipid antibodies.

Appendix 2

New England Journal of Medicine Case 17-2008 – A 63-year-old man with metastatic renal-cell carcinoma: (N Engl J Med 2008; 358:2389–96 DOI: 10.1056/NEJMcpc0802449):

  1. Isabel Pro entries

    1. Adult 50–64 years, male, travel limited to North America

    2. Mass in kidney, pulmonary nodules, nausea, loss of appetite, abdominal pain, constipation, fatigue, weight loss, scrotal varicocele, nonproductive cough

  2. ChatGPT4 entries

  3. Prepare a differential diagnosis listing for a 63-year-old male patient, travel limited to North America, presenting with mass in kidney, pulmonary nodules, nausea, loss of appetite, abdominal pain, constipation, fatigue, weight loss, scrotal varicocele, nonproductive cough. List all likely diagnoses, but at least 40, in order of likelihood. List all references used in preparing the differential and ensure that all references are the most current in your dataset.

References

1. Wachter, RM, Brynjolfsson, E. Will generative artificial intelligence deliver on its promise in health care? JAMA 2024;33:65–9. https://doi.org/10.1001/jama.2023.25054.Search in Google Scholar PubMed

2. Liu, X, Song, Y, Lin, H, Xu, Y, Chen, C, Yan, C, et al.. Evaluating chatgpt as an adjunct for analyzing challenging case. Blood 2023;142:7273–. https://doi.org/10.1182/blood-2023-181518.Search in Google Scholar

3. Horiuchi, D, Tatekawa, H, Shimono, T, Walston, SL, Takita, H, Matsushita, S, et al.. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66:73–9. https://doi.org/10.1007/s00234-023-03252-4.Search in Google Scholar PubMed

4. Hailu, R, Beam, A, Mehrotra, A. ChatGPT-assisted diagnosis: is the future suddenly here? STAT 2023.Search in Google Scholar

5. Dave, T, Athaluri, SA, Singh, S. Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 2023;6:1169595. https://doi.org/10.3389/frai.2023.1169595.Search in Google Scholar PubMed PubMed Central

6. Hirosawa, T, Kawamura, R, Harada, Y, Mizuta, K, Tokumasu, K, Kaji, Y, et al.. ChatGPT-generated differential diagnosis lists for complex case–derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform 2023;11:e48808. https://doi.org/10.2196/48808.Search in Google Scholar PubMed PubMed Central

7. Eriksen, AV, Möller, S, Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 2023;1. https://doi.org/10.1056/aip2300031.Search in Google Scholar

8. Kanjee, Z, Crowe, B, Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78. https://doi.org/10.1001/jama.2023.8288.Search in Google Scholar PubMed PubMed Central

9. Graber, ML, Mathew, A. Performance of a web-based clinical diagnosis support system for internists. J Gen Intern Med 2008;23:37–40. https://doi.org/10.1007/s11606-007-0271-8.Search in Google Scholar PubMed PubMed Central

10. Bond, WF, Schwartz, LM, Weaver, KR, Levick, D, Giuliano, M, Graber, ML. Differential diagnosis generators: an evaluation of currently available computer programs. J Gen Intern Med 2012;27:213–9. https://doi.org/10.1007/s11606-011-1804-8.Search in Google Scholar PubMed PubMed Central

11. Graber, ML. Reaching 95 %: decision support tools are the surest way to improve diagnosis now. BMJ Qual Saf 2022;31:415–8. https://doi.org/10.1136/bmjqs-2021-014033.Search in Google Scholar PubMed

12. Riches, N, Panagioti, M, Alam, R, Cheraghi-Sohi, S, Campbell, S, Esmail, A, et al.. The effectiveness of electronic differential diagnoses (DDX) generators: a systematic review and meta-analysis. PLoS One 2016;11:e0148991. https://doi.org/10.1371/journal.pone.0148991.Search in Google Scholar PubMed PubMed Central

13. Friedman, CP, Elstein, AS, Wolf, FM, Murphy, GC, Franz, TM, Heckerling, PS, et al.. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 1999;282:1851. https://doi.org/10.1001/jama.282.19.1851.Search in Google Scholar PubMed

14. Sibbald, M, Monteiro, S, Sherbino, J, LoGiudice, A, Friedman, C, Norman, G. Should electronic differential diagnosis support be used early or late in the diagnostic process? A multicentre experimental study of Isabel. BMJ Qual Saf 2021. https://doi.org/10.1136/bmjqs-2021-013493.Search in Google Scholar PubMed PubMed Central

15. Bridges, JM. Evaluation, validation, and implementation of a computerized diagnostic decision support system in primary care. Houston, Texas: University of Texas Health Science Center at Houston D. Bradley McWilliams School of Biomedical Informatics; 2022.Search in Google Scholar

16. Fritz, P, Kleinhans, A, Raoufi, R, Sediqi, A, Schmid, N, Schricker, S, et al.. Evaluation of medical decision support systems (ddx generators) using real medical cases of varying complexity and origin. BMC Med Inf Decis Mak 2022;22:254. https://doi.org/10.1186/s12911-022-01988-2.Search in Google Scholar PubMed PubMed Central

17. Walters, WH, Wilder, EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023;13:14045. https://doi.org/10.1038/s41598-023-41032-5.Search in Google Scholar PubMed PubMed Central

18. Graber, ML. Reaching 95 %: decision support tools are the surest way to improve diagnosis now. BMJ Qual Saf 2022;31:415–8. https://doi.org/10.1136/bmjqs-2021-014033.Search in Google Scholar PubMed

19. Liu, J, Wang, C, Liu, S. Utility of chatgpt in clinical practice. J Med Internet Res 2023;25:e48568. https://doi.org/10.2196/48568.Search in Google Scholar PubMed PubMed Central

Received: 2024-02-15
Accepted: 2024-04-22
Published Online: 2024-05-07

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. The growing threat of hijacked journals
  4. Review
  5. Effects of SNAPPS in clinical reasoning teaching: a systematic review with meta-analysis of randomized controlled trials
  6. Mini Review
  7. Diagnostic value of D-dimer in differentiating multisystem inflammatory syndrome in Children (MIS-C) from Kawasaki disease: systematic literature review and meta-analysis
  8. Opinion Papers
  9. Masquerade of authority: hijacked journals are gaining more credibility than original ones
  10. FRAMED: a framework facilitating insight problem solving
  11. Algorithms in medical decision-making and in everyday life: what’s the difference?
  12. Original Articles
  13. Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4
  14. Comparative analysis of diagnostic accuracy in endodontic assessments: dental students vs. artificial intelligence
  15. Assessing the Revised Safer Dx Instrument® in the understanding of ambulatory system design changes for type 1 diabetes and autism spectrum disorder in pediatrics
  16. The Big Three diagnostic errors through reflections of Japanese internists
  17. SASAN: ground truth for the effective segmentation and classification of skin cancer using biopsy images
  18. Computable phenotype for diagnostic error: developing the data schema for application of symptom-disease pair analysis of diagnostic error (SPADE)
  19. Development of a disease-based hospital-level diagnostic intensity index
  20. HbA1c and fasting plasma glucose levels are equally related to incident cardiovascular risk in a high CVD risk population without known diabetes
  21. Short Communications
  22. Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?
  23. Analysis of thicknesses of blood collection needle by scanning electron microscopy reveals wide heterogeneity
  24. Letters to the Editor
  25. For any disease a human can imagine, ChatGPT can generate a fake report
  26. The dilemma of epilepsy diagnosis in Pakistan
  27. The Japanese universal health insurance system in the context of diagnostic equity
  28. Case Report – Lessons in Clinical Reasoning
  29. Lessons in clinical reasoning – pitfalls, myths, and pearls: a case of tarsal tunnel syndrome caused by an intraneural ganglion cyst
Downloaded on 11.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/dx-2024-0033/html
Scroll to top button