Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4

Joe M. Bridges

doi:10.1515/dx-2024-0033

Article Open Access

Computerized diagnostic decision support systems – a comparative performance study of Isabel Pro vs. ChatGPT4

Joe M. Bridges

Published/Copyright: May 7, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Diagnosis Volume 11 Issue 3

Abstract

Objectives

Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro.

Methods

This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis’s rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis.

Results

Both systems produced comprehensive differential diagnosis lists. Isabel Pro’s list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %).

Conclusions

This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.

Keywords: artificial intelligence; diagnosis; computer assisted; Isabel Pro; ChatGPT4

Introduction

Generative Pre-Trained Large Language Models (LLMs), such as ChatGPT4, became available in the spring of 2023. These systems take artificial intelligence from natural language processing and pattern recognition to a whole new level of human-like performance. The growth in the use of these models has been remarkable [1]. The enthusiasm for these models has spread to their use in many facets of medicine, including diagnosis [2], [3], [4], [5]. However, validation studies of these systems have been limited to a few articles with limited datasets comparing the performance to clinicians or medical journal readers [2], [3], [4]. Reference [2] analyzed two cases with ChatGPT responses evaluated by senior clinicians. Reference [3] evaluated 100 cases from the Journal of Neuroradiology each with a clinical consensus diagnosis. Reference [4] used clinical vignettes used to test medical students and residents. Each article suggested the need for additional validation if clinical use were to be made. Before the arrival of ChatGPT4, several studies have encouraged the use of computerized diagnostic decision support systems (CDDSSs) to improve diagnostic accuracy [5], [6], [7], [8]. The use of CDDSSs, particularly Isabel Pro (itself a system employing artificial intelligence in the form of natural language processing), has been shown to improve clinicians’ diagnostic accuracy in several studies [9], [10], [11, 15]. This study aims to validate the LLM, ChatGPT4, as a CDDSS by comparing its performance to Isabel Pro’s using a set of 201 cases from the NEJM and the library of Dr. Charles Friedman, each case with a confirmed clinical consensus diagnosis. This dataset is roughly three to four times larger than any previous study and covers numerous medical specialties and patient ages. The study also analyzes a differential diagnosis listing longer than previous studies, many limited to the top 10 presentations.

Research question

Given that studies have shown a statistically significant improvement in clinicians’ diagnostic accuracy using Isabel Pro [14, 15], does the LLM system, ChatGPT4 produce a greater number of accurate diagnoses ranked higher in presentation than Isabel Pro?

Methods

Software systems

The study employs a commercially available diagnostic decision support system, Isabel Pro, and an LLM artificial intelligence system, ChatGPT4, to produce a differential diagnosis listing.

Isabel Pro

Founded in 1999 by Jason Maude, Isabel Healthcare, Ltd. produces medical diagnosis decision support systems for physicians and patients to improve diagnostic accuracy. One product is Isabel Pro, a CDDSS that takes minimal input of the patient’s age, sex at birth, pregnancy status if female, travel history if outside North America, and a drop-down list from which presenting symptoms can be selected or entered manually and used in a proprietary search algorithm to match the symptoms against disease conditions in a proprietary high-quality database of medical references, including the Merck Manual Professional, Cochrane Reports, and similar reference texts. The program lists the diagnoses in order of the frequency with which the symptoms appear in the reference dataset for a particular disease condition. Isabel Pro’s reference dataset is updated monthly. The study by Riches et al. [12] documented the performance of Isabel Pro against several CDDSSs, and the study by Sibbald et al. [14] documented the improvement (7–8%) in diagnostic accuracy of clinicians when using Isabel Pro.

ChatGPT4

ChatGPT4 is the most recent version of a Generative Pre-trained Transformer (GPT) developed by OpenAI and released on March 14, 2023. The company reports that the system can read, analyze, and generate up to 25,000 words and write code in all major programming languages. An LLM artificial intelligence system relies on a training set from which the system extracts information to perform various tasks. For this study, the task was producing a differential diagnosis. ChatGPT4 is trained on Common Crawl, a publicly available dataset and one of the most extensive text datasets. While not trained explicitly for medical diagnosis, ChatGPT4 accesses medical textbooks, medical websites, medical papers, and other reference material up to its most recent training date, January 2022.

Datasets

This study used 201 cases with a confirmed clinical consensus diagnosis. The study used a set of 36 cases from Dr. Charles P. Friedman [13], used in the Sibbald et al. study [14] and by the author with Dr. Friedman’s consent in the author’s Translational Project paper [15]. The remaining cases came from the NEJM. A recent paper by Fritz et al. used 50 cases from the NEJM [16]. A paper co-authored by Dr. Mark Graber used 10 cases from the NEJM [10]. Finally, the study used 105 cases from the NEJM, previously published on the Isabel Healthcare website. This dataset is three to four times larger than any previous study and covers a wide range of patients by sex and age and a wide variety of disease conditions. Table 1 shows the Study Dataset Demographics with a male/female ratio of 1.25, 26 % age 50–64, 20 % 65-over, 18 % 30–39, 13 % 40–49, and 12 % 17–29. Medical specialties comprising 65 % of the cases are infectious diseases, neoplasms, rheumatology, respiratory, cardiovascular, hematology, endocrinology, and gastrointestinal.

Table 1:

Study dataset demographics. Sex at birth and medical specialty by age range.

Age range of patients, years
	Total	29 days–1 year	1–5	6–12	13–16	17–29	30–39	40–49	50–64	65-Over

Sex

Female	89	0	0	4	3	13	15	15	18	21
Male	112	5	3	2	1	12	22	12	35	20
Total	201	5	3	6	4	25	37	27	53	41

Medical specialty

Infectious diseases	41	1	1	0	0	7	9	2	10	11
Neoplasms	26	0	2	1	0	0	6	3	8	6
Rheumatology	13	0	0	0	1	1	2	4	3	2
Respiratory	11	1	0	1	0	1	2	2	2	2
Cardiovascular	11	0	0	0	0	0	1	0	8	2
Hematology	10	0	0	0	1	2	1	2	3	1
Endocrinology	9	0	0	0	1	1	1	1	5	0
Gastrointestinal	9	0	0	2	0	2	2	1	1	1
Neurology	8	0	0	0	0	3	3	1	0	1
Neuromuscular	6	0	0	0	1	0	1	1	2	1
Orthopedics	6	0	0	0	0	1	1	2	2	0
Immunology	6	0	0	1	0	0	0	1	1	3
Metabolic diseases	6	0	0	0	0	0	2	1	1	2
Vascular diseases	5	0	0	0	0	0	0	1	1	3
Gynecology	4	1	0	0	0	0	1	1	1	0
Psychology	4	2	0	0	0	0	0	1	1	0
Toxicology	4	0	0	0	0	0	1	2	1	0
Ear, nose and throat	3	0	0	0	0	1	1	0	0	1
Environmental	2	0	0	0	0	0	0	0	0	2
Obstetrics	2	0	0	0	0	1	0	1	0	0
Ophthalmology	2	0	0	0	0	1	0	0	1	0
Urology	2	0	0	0	0	1	0	0	0	1
Allergies	1	0	0	0	0	0	0	0	0	1
Neoplasm/rheumatology	1	0	0	0	0	0	0	0	0	1
Neurology/respiratory	1	0	0	0	0	0	0	0	1	0
Totals	201	5	3	6	4	25	37	27	53	41

Since the NEJM cases are in the public domain, an unfair advantage might accrue to ChatGPT4’s benefit. In none of the 201 cases was the NEJM listed as a source for the diagnosis presented. When requested to find a specific NEJM case, ChatGPT4’s response was completely inaccurate as to case information and diagnosis (see Appendix 1).

Input and output

Each system was given identical information on each case, and each system was asked to produce a differential diagnosis listing in order of likelihood or match of symptoms. Isabel Pro’s input is structured. The output is a similarly structured presentation, with the diagnoses listed in the sequence for which the presentation symptoms appear most frequently in the reference dataset. The input information was the patient’s age or age range, the patient’s sex assigned at birth, pregnancy status (if female), travel history if outside North America, and the patient’s presenting symptoms.

ChatGPT4 employs an inquiry format applicable to unstructured requests and the possibility of recurrent exchanges on the same query topic. This study used a structured request format for ChatGPT4, mirroring the Isabel Pro input format. Each request to ChatGPT4 ended with a requirement to list the diagnoses in order of likelihood of the condition, produce at least 40 diagnoses, and cite all reference material consulted with a complete citation, DOI, and the basis for the ranking.

NEJM Case 17-2008, is an example of input information. Both systems produced the correct diagnosis, “Renal cell carcinoma”, as the first differential presented. This procedure was followed in each of the 201 cases to record the correct diagnosis, if produced, and its rank within the differential listing produced by each system. If the correct diagnosis failed to appear, the diagnosis was listed as N/A and ranked 50 for analytical purposes. Reference citations and DOIs were confirmed, marking them as correct, incorrect, not found, or not produced.

Statistical methods

Table 3 tabulates the total number of correct diagnoses for each system, the number of cases for which a system’s correct diagnosis ranking outranked the other system, the number of cases for which the systems produced the same rank, and the number of cases for which neither system produced a correct diagnosis. The study compared the systems using the Mean Reciprocal Rank (MRR), defined as the average of the inverse of the rank for the total number of cases. Since there is only one correct answer for each case, the MRR measures the average rank at which the correct answer would appear for the set of requests evaluated. The study also ranked the correct diagnoses for each system according to the frequency with which the correct diagnosis was equal to or less than the ranks 1, 5, 10, 20, 30, and 40. The study referred to this figure as Recall at Rank, calculated by recording the number 1 for each rank equal to or less than the evaluated rank and averaging the sum. Recall at Rank is the frequency with which a correct diagnosis appears at a ranking equal to or less than the evaluated rank.

Upon completing all 201 cases, a Wilcoxon Signed Rank Sum calculation, a nonparametric hypothesis test, assessed whether either system’s Recall at Rank showed a statistically significant difference (see Table 4).

Ethical aspects

The cases evaluated contain no personal health information. All NEJM cases are publicly available on the publication’s website, NEJM.org. The author used the cases from Dr. Charles Friedman with permission. The author reports no conflicts of interest, and the study required no outside funding sources.

Results

Qualitative results

Table 2 contains an alphabetical list of the diagnoses missed by each system. As might be expected, the missed diagnoses are often conditions with very low, if not rare, incidence or where symptoms develop over time.

Table 2:

Missed diagnoses.

Diagnoses missed by Isabel Pro

Diagnoses missed by ChatGPT4

Adenovirus infection, disseminated
Anemia due to parvovirus B19 infection and autoimmune hemolytic anemia
Aortic dissection
Blastomycosis
Burkitt-like lymphoma with 11q aberrations
Cardiac amyloidosis
Colonic mass due to infection with an intravascular nematode, most likely Angiostrongylus costaricensis
cryoglobulinemia
Cryptococcal meningitis
Gaucher’s disease
Granulomatosis with polyangiitis
Hairy cell leukemia
Hypokalemic periodic paralysis
Miliary (disseminated) TB
Non-Hodgkins lymphoma
Ocular syphilis
Papulonecrotic tuberculid (cutaneous tuberculosis)
Phenochromocytoma
Polyarteritis nodosa
Porphyria (cutnea tarda)
Post-traumatic stress disorder
Pulmonary artery aneurysm that was most likely due to the Hughes–Stovin syndrome
Renal lymphangiectasis (hygroma renalis),
Rickettsiosis of the spotted fever group
Silicosis
The drug reaction with eosinophilia and systemic symptoms (DRESS) syndrome, associated with human herpesvirus 6 infection or reactivation
Silicosis
The drug reaction with eosinophilia and systemic symptoms (DRESS) syndrome, associated with human herpesvirus 6 infection or reactivation

Acrodermatitis enteropathica
Acute cellulitis due to P. multocida, followed by glandular and systemic infection with F. tularensis (tularemia), both after a cat bite
Adenocarcinoma of the lung (non-small-cell lung cancer)
Anaplasma phagocytophilum infection (human granulocytotropic anaplasmosis)
Autoimmune lymphoproliferative syndrome (ALPS) due to a germline FAS mutation
Babesiosis
Brown tumor associated with osteitis fibrosa cystica
Catastrophic antiphospholipid syndrome
Cushing’s syndrome due to ectopic secretion of adrenocorticotropic hormone from small-cell lung cancer
Ductal adenocarcinoma of the prostate
Gamma heavy-chain disease
Gastric mucormycosis caused by rhizopus species
Gaucher’s disease
Hairy cell leukemia
Hemolytic uremic syndrome
Human granulocytic anaplasmosis
Hypersensitivity pneumonitis
Infected intracardiac foreign body
Inflammatory cerebral amyloid angiopathy
Intestinal tuberculosis
Malignant melanoma with features resembling a spitz nevus
Mucormycosis
Mucous-membrane pemphigoid
Multiple Tarlov (perineurial) cysts
Osteomalacia
Pernicious anemia
Polyarteritis nodosa
Primary lung adenocarcinoma with metastasis involving the mediastinal lymph nodes
Pulmonary artery aneurysm that was most likely due to the Hughes–Stovin syndrome
Renal lymphangiectasis (hygroma renalis),
Silicosis
Subarachnoid hemorrhage
Susac’s syndrome
Susac’s syndrome (retinocochleocerebral vasculopathy)
Thrombotic thrombocytopenic purpura
Urothelial-type adenocarcinoma of the prostatic urethra

Quantitative results

Overall, the two systems performed comparably on the 201 cases, with Isabel Pro producing 175 correct diagnoses (87.1 %) while ChatGPT4 produced 165 correct diagnoses (82.1 %). In that process, ChatGPT4 produced 83 diagnoses, which outranked the Isabel Pro diagnoses, while Isabel Pro and ChatGPT4 produced the same ranking in 37 cases. ChatGPT4 outperformed Isabel Pro in producing correct diagnoses ranked 1, top 5, and top 10. Isabel Pro more often returned the correct diagnosis within the top 20, top 30, and top 40. ChatGPT4 more often returned the correct diagnosis at a ranking of 1 compared to Isabel Pro. Isabel Pro outperformed ChatGPT4 within the top 20 and ultimately returned more correct diagnoses (see Table 3). The MRR measures the average rank at which the system will return the single correct diagnosis, making it a relevant yardstick for these two systems. ChatGPT4 produced an MRR of 0.431, the correct diagnosis likely appearing at an average ranking of 2.32, compared to Isabel Pro at 0.389, the correct diagnosis likely appearing at an average ranking of 2.57. Both systems presented the correct diagnosis at an average ranking of three or less (Figures 1 and 2).

Table 3:

Diagnostic accuracy – comparison of results by system.

Total cases	201	Total cases	201
Isabel Pro out-ranks ChatGPT4	75	ChatGPT4 out-ranks Isabel Pro	83
Both correct tied	37	Both correct tied	37
Neither right	6	Neither right	6
Isabel Pro correct diagnoses	175	ChatGPT4 correct diagnoses	164

Figure 1:

Diagnostic retrieval accuracy comparison by recall at rank.

Figure 2:

Case sequence recall at rank comparison: Isabel Pro vs. ChatGPT4.

System equivalence

The Wilcoxon Signed Rank Sum Test assesses the median difference between two paired, nonparametric samples. This test comparing the Recall at Rank for Isabel Pro vs. ChatGT4 is a typical nonparametric sample where the null hypothesis postulates no difference between the two samples. To reject the null hypothesis, the Wilcoxon Signed Rank Sum Test would return a p-value equal to or less than 0.05. Table 4 shows the results of all three hypotheses, each exceeding a p-value of 0.05 by a wide margin. The Wilcoxon test fails to reject the Null Hypothesis and the data provides no evidence to suggest that the score distributions are due to chance. If the sample estimate of the effect size is accurate in the general population (r=0.01), the analysis would require more than 80,000 cases to produce a p-value of significance allowing the conclusion that the systems are not equivalent. Without sufficient cases, we cannot definitively reject or accept the null hypothesis.

Table 4:

Wilcoxon ranked sum hypothesis test.

Null hypothesis	Alternative hypothesis	W	z	p	r
The variable ChatGPT4 rank has smaller or equal values than the variable Isabel Pro rank	The variable ChatGPT4 rank has larger values than the variable Isabel Pro rank	6202.5	−0.14	0.446	0.01
The variable Isabel Pro rank has smaller or equal values than the variable ChatGPT4 rank	The variable Isabel Pro rank has larger values than the variable ChatGPT4 rank	6202.5	−0.14	0.554	0.01
There is no difference between the variables Isabel Pro rank and ChatGPT4 rank	There is a difference between the variables Isabel Pro rank and ChatGPT4 rank	6202.5	−0.14	0.892	0.01

Diagnosis ranking explanation

A particularly concerning issue is the question, “How was a diagnosis determined?” Isabel Pro uses the frequency with which presenting symptoms appear in the medical reference database to rank the differential listing. This study asked ChatGPT4 to list the diagnoses in order of likelihood, to cite the references consulted in producing each diagnosis, and to state the basis for the ranking of each diagnosis. The responses, however, were short and inadequate to explain the mechanism for producing and ranking a diagnosis. The responses to the request “provide the basis for the ranking of the diagnosis” were uniformly brief, such as “It is often the case that … ” followed by a statement that a symptom “often” indicates a particular disease condition. In almost every response, only one reference was listed, so no accurate conclusion could be drawn about the mechanism for ranking the differential.

Reference ‘hallucination’

A study by Walters and Wilder examined the propensity for ChatGPT to “hallucinate” or fabricate bibliographic references [17]. For this study, the author searched the literature to find and confirm the validity of both the reference and the DOI for each reference cited by ChatGPT4. While 145 references were confirmed correct (87.9 %), only 52 DOIs were correct (31.5 %).

Hallucination, or more correctly, fabrication, falls into two categories: First, the article was very smoothly cited, using authors, journal, volume, issue, and page numbers, but much of the information was incorrect. Second, the article cited could not be found using a very diligent internet search. As mentioned, ChatGPT4 produced only 145 (87.9 %) correct references, determined to be correct only by a thorough internet search. ChatGPT4 did not indicate which reference might be correct without searching every reference. Confidence in the accuracy of a given diagnosis is significantly eroded thereby.

Discussion

When asked to produce a list of diagnoses based on demographics and presenting symptoms, the diagnostic accuracy of IsabelPro and ChatGPT4 was very close. While IsabelPro’s reference dataset and ranking algorithm are proprietary, the reference dataset for Isabel Pro is derived from the most respected sources and has been updated monthly for more than 29 years. The ranking system is straightforward and clearly stated as the frequency of presenting symptoms in the literature for that disease condition. In contrast, ChatGPT4’s ranking methodology is unknown, and the system did not offer a satisfactory explanation for the basis of its ranking. The basis by which ChatGPT4 posts and ranks its differential is unclear, and the references it cites to accompany those diagnoses are far too often fabricated.

Regarding response speed, the differential diagnosis listing produced by Isabel Pro appears within seconds, most often immediately upon submission, a significant feature in today’s hurried primary care practices. ChatGPT4, on the other hand, produces its differential diagnosis listing by typing the responses in sequences of roughly 10 diagnoses. This process also requires continuing requests for additions to the list, a time-consuming process that can take up to 10 minutes or longer. Voice commands are modestly helpful in the speed of retrieval.

Rather than a relatively small group of clinicians or journal readers, the study compares ChatGPT4 to a computerized diagnostic decision support system, Isabel Pro, shown in several studies to improve clinicians’ diagnostic accuracy by 7–9 % [10, 11]. The dataset of cases is drawn from highly regarded sources (New England Journal of Medicine and the University of Michigan Medical School) and is more extensive than previous studies by a factor of three to four. Compared to previous studies, the dataset covers a broader range of medical specialties and patient sex and age. Each case carries a confirmed clinical consensus diagnosis for validation purposes. The study employs statistically significant comparison methods to evaluate the performance of the two systems – Mean Reciprocal Rank, Recall at Rank, and Wilcoxon Signed Rank Sum Test – and extends the comparisons beyond only the top 10 diagnoses presented. The study highlights a significant weakness of ChatGPT4 in the fabrication of references, a failing that should preclude its use in clinical practice until corrected.

Estimates of disease conditions are as high as 10,000, with symptoms estimated at 200. Even a dataset of 201 cases is unlikely to provide the ultimate diagnostic challenges. For example, the NEJM cases are diagnostically challenging, but a new case is presented every two weeks with little or no duplication of disease conditions over several decades. This study does not address the usage of either of these systems in routine clinical practice, nor does any previous study.

Previous studies have assessed the performance of ChatGPT4 compared to clinicians or journal readers but have not evaluated what computerized diagnostic decision support system might be a better choice for the diagnosing clinician. The implicit assumption in these studies is that ChatGPT4 is the option of choice for computerized assistance. Earlier studies have often limited the differential listing to the top 10, an arbitrary limit that, in the author’s opinion, unnecessarily hampers the diagnostic process, reducing the diagnostic accuracy of Isabel Pro and ChatGPT4 to 65 and 69 % in the top 10 from 78 and 76 % in the top 20, and 87 and 82 % in the top 40. A differential listing of 40 rather than 10 offers a 33 % improvement in diagnostic accuracy. While considering a list of differential diagnoses with more than 20 possibilities might discourage a novice, ruling in or out many alternatives is a task at which the experienced physician is quite adept [18].

The main takeaway from this study is concern about the process of deriving the differential diagnostic listing by ChatGPT4. The process is a “black box” that is unstated and unclear and often fabricates references, suggesting great caution in using the system for diagnosis. The fabrication of references is a relatively common complaint from those using and studying AI systems such as ChatGPT4 [17, 19], but in diagnosis, the fabrication of references is a serious failing that cannot be tolerated. ChatGPT4, or similar AI systems, are unlikely to be used by clinicians until and unless the system is specifically trained on the best medical reference sources and validated on enough cases to confirm its diagnostic accuracy. The individual clinician using Isabel Pro can click on a given diagnosis and be taken instantly to the Merck Manual Professional or to whatever medical reference source is chosen. Even a dozen or more diagnostic alternatives could be verified in just a few minutes. Not so with ChatGPT4. ChatGPT4 unequivocally fabricates references even when the accurate diagnosis is known and presented. One can only imagine the clinical difficulty in relying on such a system when diagnosing a yet unknown condition.

Both systems failed to diagnose several cases correctly. A future study might concentrate on those cases to determine the causes and propose improvements to address those failings. A future study might also concentrate on rare diseases to determine the capability of these systems to diagnose such conditions, a capability that might be most helpful to the primary care physician. Methods of integrating these systems into routine clinical practice were not a goal of this study, but smooth integration into the clinical workflow is a crucial system component yet to be solved. For ChatGPT4, usage in clinical diagnosis will be unlikely until the issue of fabricating references is resolved in favor of accurate and unequivocal citations.

Conclusions

Isabel Pro and ChatGPT4 displayed roughly comparable performance in producing differential diagnosis lists that contained the correct diagnosis. The diagnostic retrieval accuracy of both systems approached 90 %. Both Isabel Pro and ChatGPT4 have the elements needed for a CDDSS to warrant use by clinicians: retrieval of accurate diagnoses, speedy response to inquiries, and presentation of accurate and comprehensive differential diagnosis lists. Both systems suffer from the requirement for re-entry of patient demographics and presenting symptoms to produce the differential.

Isabel Pro is a finely crafted system that is easy to use, fast, and accurate and allows for quickly ruling diagnoses in or out by linking each presented diagnosis to the best medical reference for rapid consultation. Isabel Pro is commercially available, relatively inexpensive, and time-tested for over 25 years.

ChatGPT4 showed impressive performance in diagnosing the cases used in this study despite no explicit training on the best and most current medical reference sources. However, the methods used by the LLM, the “black box” aspect of the system, is a feature that is unappealing to the clinician and its frequent fabrication of references limits trust in the accuracy of the results. This study attempted to address this issue by requesting that the system cite the reference consulted in producing the diagnosis with a complete citation, DOI, and basis for the ranking. While helpful on many diagnoses, the systems often fabricated citations and, even more often, fabricated DOIs. These failings cast doubt on a clinician’s willingness to use the results confidently in the clinical setting where time is short and the diagnosis is yet to be confirmed. The time required to wait for the sequential typing of the differential listing and the repeated requests to complete the listing will hinder the system’s use in clinical settings.

This study highlights and validates the promise of CDDSSs in reducing diagnostic inaccuracy in medicine. This study also highlights and validates the promising future of the Large Language Model form of artificial intelligence in diagnosis. However, until the advent of an LLM trained on and with continuous access to the best and most recent medical reference material and until the issue of reference citations resolves in favor of strict and unequivocal accuracy, clinicians will likely make cautious use of artificial intelligence systems in diagnosis, if at all.

CDDSSs are not perfect, but studies show they improve clinicians’ diagnostic accuracy regardless of experience level. The search for higher accuracy in a CDDSS is continuing and should continue since diagnostic accuracy among physicians in the US is estimated to be only 85–95 %, suggesting 12 million inaccurate diagnoses annually at the low end of the estimate. Anything that can significantly improve the accuracy of what is arguably the most challenging task undertaken by humans and one that, if incorrect, results in patient harm, even death, is well worth pursuing with as much effort as possible.

Corresponding author: Joe M. Bridges, DHI, Adjunct Assistant Professor, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street UCT600, Houston, TX 77030, USA, E-mail: joe.bridges@uth.tmc.edu

Acknowledgments

The author wishes to acknowledge Dr. Xiaoqian Jiang, Professor, Dr. Han Chen, Associate Professor, and Dr. Kirk Roberts, Associate Professor, for their assistance in statistical comparison methods.

Research ethics: This study required no funding source. The author notes no conflicts of interest. The study included no Personal Health Information in any of the cases. No ethical approvals were necessary.
Informed consent: Not applicable.
Author contributions: The author has accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: The author states no conflict of interest.
Research funding: None declared.
Data availability: The raw data can be obtained on request from the corresponding author.

Appendix 1

Please find New England Journal of Medicine Case 2-2008.

ChatGPT

I found a reference to Case 2-2008 from the New England Journal of Medicine (NEJM) related to rheumatic mitral stenosis associated with partial anomalous pulmonary venous return (VIVO). However, for detailed information or to access the full article, you would need to visit the NEJM’s website or access their archives, as direct links or full article contents were not provided in the search results.

Was there a final diagnosis in the reference you found and, if so, what was it?

ChatGPT

Yes, the final diagnosis in Case 2-2008 from the New England Journal of Medicine was “rheumatic mitral stenosis associated with partial anomalous pulmonary venous return” (VIVO). This diagnosis points to a condition where the mitral valve is narrowed (mitral stenosis), typically due to rheumatic fever, and is accompanied by a congenital heart defect where some of the pulmonary veins (which carry oxygenated blood from the lungs back to the heart) are connected to the heart in an abnormal location (partial anomalous pulmonary venous return).

However, New England Journal of Medicine Case 2-2008 is actually:

Case 2-2008: a 38-year-old woman with postpartum visual loss, shortness of breath, and renal failure, with a final diagnosis of:

Postpartum renal failure due to thrombotic microangiopathy, associated with antiphospholipid antibodies.

Appendix 2

New England Journal of Medicine Case 17-2008 – A 63-year-old man with metastatic renal-cell carcinoma: (N Engl J Med 2008; 358:2389–96 DOI: 10.1056/NEJMcpc0802449):

Isabel Pro entries
1. Adult 50–64 years, male, travel limited to North America
2. Mass in kidney, pulmonary nodules, nausea, loss of appetite, abdominal pain, constipation, fatigue, weight loss, scrotal varicocele, nonproductive cough
ChatGPT4 entries
Prepare a differential diagnosis listing for a 63-year-old male patient, travel limited to North America, presenting with mass in kidney, pulmonary nodules, nausea, loss of appetite, abdominal pain, constipation, fatigue, weight loss, scrotal varicocele, nonproductive cough. List all likely diagnoses, but at least 40, in order of likelihood. List all references used in preparing the differential and ensure that all references are the most current in your dataset.

References

1. Wachter, RM, Brynjolfsson, E. Will generative artificial intelligence deliver on its promise in health care? JAMA 2024;33:65–9. https://doi.org/10.1001/jama.2023.25054.Search in Google Scholar PubMed

2. Liu, X, Song, Y, Lin, H, Xu, Y, Chen, C, Yan, C, et al.. Evaluating chatgpt as an adjunct for analyzing challenging case. Blood 2023;142:7273–. https://doi.org/10.1182/blood-2023-181518.Search in Google Scholar

3. Horiuchi, D, Tatekawa, H, Shimono, T, Walston, SL, Takita, H, Matsushita, S, et al.. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66:73–9. https://doi.org/10.1007/s00234-023-03252-4.Search in Google Scholar PubMed

4. Hailu, R, Beam, A, Mehrotra, A. ChatGPT-assisted diagnosis: is the future suddenly here? STAT 2023.Search in Google Scholar

5. Dave, T, Athaluri, SA, Singh, S. Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 2023;6:1169595. https://doi.org/10.3389/frai.2023.1169595.Search in Google Scholar PubMed PubMed Central

6. Hirosawa, T, Kawamura, R, Harada, Y, Mizuta, K, Tokumasu, K, Kaji, Y, et al.. ChatGPT-generated differential diagnosis lists for complex case–derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform 2023;11:e48808. https://doi.org/10.2196/48808.Search in Google Scholar PubMed PubMed Central

7. Eriksen, AV, Möller, S, Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 2023;1. https://doi.org/10.1056/aip2300031.Search in Google Scholar

8. Kanjee, Z, Crowe, B, Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78. https://doi.org/10.1001/jama.2023.8288.Search in Google Scholar PubMed PubMed Central

9. Graber, ML, Mathew, A. Performance of a web-based clinical diagnosis support system for internists. J Gen Intern Med 2008;23:37–40. https://doi.org/10.1007/s11606-007-0271-8.Search in Google Scholar PubMed PubMed Central

10. Bond, WF, Schwartz, LM, Weaver, KR, Levick, D, Giuliano, M, Graber, ML. Differential diagnosis generators: an evaluation of currently available computer programs. J Gen Intern Med 2012;27:213–9. https://doi.org/10.1007/s11606-011-1804-8.Search in Google Scholar PubMed PubMed Central

11. Graber, ML. Reaching 95 %: decision support tools are the surest way to improve diagnosis now. BMJ Qual Saf 2022;31:415–8. https://doi.org/10.1136/bmjqs-2021-014033.Search in Google Scholar PubMed

12. Riches, N, Panagioti, M, Alam, R, Cheraghi-Sohi, S, Campbell, S, Esmail, A, et al.. The effectiveness of electronic differential diagnoses (DDX) generators: a systematic review and meta-analysis. PLoS One 2016;11:e0148991. https://doi.org/10.1371/journal.pone.0148991.Search in Google Scholar PubMed PubMed Central

13. Friedman, CP, Elstein, AS, Wolf, FM, Murphy, GC, Franz, TM, Heckerling, PS, et al.. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 1999;282:1851. https://doi.org/10.1001/jama.282.19.1851.Search in Google Scholar PubMed

14. Sibbald, M, Monteiro, S, Sherbino, J, LoGiudice, A, Friedman, C, Norman, G. Should electronic differential diagnosis support be used early or late in the diagnostic process? A multicentre experimental study of Isabel. BMJ Qual Saf 2021. https://doi.org/10.1136/bmjqs-2021-013493.Search in Google Scholar PubMed PubMed Central

15. Bridges, JM. Evaluation, validation, and implementation of a computerized diagnostic decision support system in primary care. Houston, Texas: University of Texas Health Science Center at Houston D. Bradley McWilliams School of Biomedical Informatics; 2022.Search in Google Scholar

16. Fritz, P, Kleinhans, A, Raoufi, R, Sediqi, A, Schmid, N, Schricker, S, et al.. Evaluation of medical decision support systems (ddx generators) using real medical cases of varying complexity and origin. BMC Med Inf Decis Mak 2022;22:254. https://doi.org/10.1186/s12911-022-01988-2.Search in Google Scholar PubMed PubMed Central

17. Walters, WH, Wilder, EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023;13:14045. https://doi.org/10.1038/s41598-023-41032-5.Search in Google Scholar PubMed PubMed Central

18. Graber, ML. Reaching 95 %: decision support tools are the surest way to improve diagnosis now. BMJ Qual Saf 2022;31:415–8. https://doi.org/10.1136/bmjqs-2021-014033.Search in Google Scholar PubMed

19. Liu, J, Wang, C, Liu, S. Utility of chatgpt in clinical practice. J Med Internet Res 2023;25:e48568. https://doi.org/10.2196/48568.Search in Google Scholar PubMed PubMed Central

Received: 2024-02-15

Accepted: 2024-04-22

Published Online: 2024-05-07

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/dx-2024-0033

Keywords for this article

artificial intelligence; diagnosis; computer assisted; Isabel Pro; ChatGPT4

Creative Commons

BY 4.0