Abstract
Objectives
Evaluate the impact of problem representation (PR) characteristics on Generative Artificial Intelligence (GAI) diagnostic accuracy.
Methods
Internal medicine attendings and residents from two academic medical centers were given a clinical vignette and instructed to write a PR. Deductive content analysis described the characteristics comprising each PR. Individual PRs were input into ChatGPT-4 (OpenAI, September 2023) which was prompted to generate a ranked three-item differential. The ranked differential and the top-ranked diagnosis were scored on a 3-part scale, ranging from incorrect, partially correct, to correct. Logistic regression evaluated individual PR characteristic’s impact on ChatGPT accuracy.
Results
For a three-item differential, accuracy was associated with including fewer comorbidities (OR 0.57, p=0.010), fewer past historical items (OR 0.60, p=0.019), and more physical examination items (OR 1.66, p=0.015). For ChatGPT’s ability to rank the true diagnosis as the single-best diagnosis, utilizing temporal semantic qualifiers, more semantic qualifiers overall, and adhering to a typical 3-part PR format all correlated with diagnostic accuracy: OR 3.447, p=0.046; OR 1.300, p=0.005; OR 3.577, p=0.020, respectively.
Conclusions
Several distinct PR factors improved ChatGPT diagnostic accuracy. These factors have previously been associated with expertise in creating PR. Future studies should explore how clinical input qualities affect GAI diagnostic accuracy prospectively.
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: Not applicable.
Vignette
Chief Complaint: “I’m having trouble breathing.”
History of Present Illness: A 47 year old Caucasian woman is admitted with new onset shortness of breath.
The patient states that 3 months ago, she started noticing that she was getting winded going up the stairs in her house. Since then, her shortness of breath has gotten worse and she now gets short of breath with just walking from her bedroom to the bathroom. She has noticed that her pants feel tighter than usual and that her stomach is bloated. Her ankles also feel swollen. Over the last 3 days, she started noticing some sharp, nonradiating chest pains under her breastbone with more strenuous activities. These latest symptoms prompted her presentation to the emergency department.
She overall has been trying to lose weight and does not think she is pregnant. She has occasionally felt lightheaded during exertion but denies feeling short of breath with lying down or awaking from sleep feeling dyspneic. She denies fevers, chills, nausea, vomiting, diarrhea, abdominal pain, blood in her stool. She has a chronic cough rarely productive of scant sputum (what she calls her “smoker’s cough”) which is unchanged from baseline.
Upon presentation to the emergency department, she reports feeling some mild dyspnea at rest but no chest pain.
Past Medical History: [Items relevant to a pulmonary hypertension diagnosis: hypertension, tobacco use]
Anxiety; Hypertension; Opioid use disorder with injection heroin use, in remission; Polycystic ovarian syndrome; Vasovagal syncope.
Past Surgical History: Cholecystectomy 5 years ago.
Family History: Mother: hypertension; Father: prostate cancer, coronary artery disease (no myocardial infarction history); No other history of malignancy.
Social History: Lives in rural Pennsylvania with her mother. Works as a waitress. Currently smokes tobacco cigarettes, 2 packs per day for the last 30 years. She denies vaping, alcohol use. She reports abstinence from recreational drugs for 15 years.
Review of Systems: As mentioned per HPI.
Medications: Losartan; Suboxone
Physical Examination:
Temp: 36.6 C, BP: 145/86, Pulse: 102, RR: 18, SpO2: 92 % on 4 L/m, BMI 31.6
General: Appears stated age, no acute distress
HEENT: Sclera are anicteric, moist mucosae, oropharynx clear. Facial plethora.
Lymph: No palpable lymphadenopathy.
Pulmonary: Lungs are clear to auscultation, symmetric chest expansion. Use of accessory muscles of respirations noted.
Cardiac: Regular rate and rhythm. No murmurs, rubs, or gallops. JVP is 13 cm at 30*. 2+ pitting edema bilaterally to knees.
Gastrointestinal: Normoactive bowel sounds. Nontender. Distention present with a fluid wave. No palpable organomegaly.
Extremities: Bilateral digital clubbing of the hands.
Neurologic: Alert and Oriented ×3. No asterixis. CN 2–12 are intact. 5/5 strength throughout. Gait normal but evaluation is limited as patient becomes visibly dyspneic.
Other Studies:
Na: 133; Cl:82; K:4.2; CO2: 42; BUN: 9; Cr: 0.6; Glu: 126
WBC: 9.2; Hgb: 13.8; Plt: 205; INR:1.3;
ALT: 8; AST: 13; Alk Phos: 135; Alb: 3.4; TP: 6.6; Total Bilirubin: 1.9
Urine pregnancy testing: negative; Troponin: negative
EKG: Sinus tachycardia with poor R-wave progression and inferior T-wave inversions that are new since last EKG.
Chest X-ray, PA and Lateral views: No acute pulmonary disease. Heart is of normal size.
References
1. Kanjee, Z, Crowe, B, Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78.10.1001/jama.2023.8288Search in Google Scholar PubMed PubMed Central
2. Kung, TH, Cheatham, M, Medenilla, A, Sillos, C, De Leon, L, Elepaño, C, et al.. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. https://doi.org/10.1371/journal.pdig.0000198.Search in Google Scholar PubMed PubMed Central
3. Eriksen, AV, Möller, S, Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 1 2023:AIp2300031.10.1056/AIp2300031Search in Google Scholar
4. Savage, T, Nayak, A, Gallo, R, Rangan, E, Chen, JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. Npj Digit Med. 2024;7:1–7. https://doi.org/10.1038/s41746-024-01010-1.Search in Google Scholar PubMed PubMed Central
5. McQuade, C, Simonson, MG, Lister, J, Olson, APJ, Zwaan, L, Rothenberger, S, et al.. What makes a good problem representation? Characteristics differentiating problem representation synthesis between novices and experts. J Hosp Med 2024;19:468–474. https://doi.org/10.1002/jhm.13335.Search in Google Scholar PubMed
6. Mamede, S, van Gog, T, van den Berge, K, Rikers, RMJP, van Saase, JLCM, van Guldener, C, et al.. Effect of availability bias and reflective reasoning on diagnostic accuracy among internal medicine residents. JAMA 2010;304:1198–203. https://doi.org/10.1001/jama.2010.1276.Search in Google Scholar PubMed
7. Bordage, G, Lemieux, M. Semantic structures and diagnostic thinking of experts and novices. Acad Med J Assoc Am Med Coll 1991;66:S70–72. https://doi.org/10.1097/00001888-199109001-00025.Search in Google Scholar
8. How ChatGPT and our language models are developed | OpenAI Help Center. https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed.Search in Google Scholar
© 2024 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial
- Pioneering diagnosis in Asia: advancing clinical reasoning expertise through the lens of 3M
- Short Communication
- The foundations of the diagnostic error movement: a tribute to Eta Berner, PhD
- Reviews
- Interventions to improve timely cancer diagnosis: an integrative review
- Technical aspects and clinical applications of synthetic MRI: a scoping review
- Mini Review
- Challenges and barriers for the adoption of personalized medicine in Europe: the case of Oncotype DX Breast Recurrence Score® test
- Opinion Papers
- Beyond thinking fast and slow: a Bayesian intuitionist model of clinical reasoning in real-world practice
- Diagnostic scope: the AI can’t see what the mind doesn’t know
- Guidelines and Recommendations
- CDC’s Core Elements to promote diagnostic excellence
- Original Articles
- Trends of diagnostic adverse events in hospital deaths: longitudinal analyses of four retrospective record review studies
- The effect of a provisional diagnosis on intern diagnostic reasoning: a mixed methods study
- On context specificity and management reasoning: moving beyond diagnosis
- Diagnostic errors in patients admitted directly from new outpatient visits
- Breaking the guidelines: how financial unawareness fuels guideline deviations and inefficient DVT diagnostics
- Harbingers of sepsis misdiagnosis among pediatric emergency department patients
- Factors affecting diagnostic difficulties in aseptic meningitis: a retrospective observational study
- Prenatal diagnostic errors in hemoglobin Bart’s hydrops fetalis caused by rare genetic interactions of α-thalassemia
- Screening fasting glucose before the OGTT: near-patient glucometer- or laboratory-based measurement?
- Three-way comparison of different ESR measurement methods and analytical performance assessment of TEST1 automated ESR analyzer
- Short Communications
- Medical language matters: impact of clinical summary composition on a generative artificial intelligence’s diagnostic accuracy
- Impact of meta-memory techniques in generating effective differential diagnoses in a pediatric core clerkship
Articles in the same Issue
- Frontmatter
- Editorial
- Pioneering diagnosis in Asia: advancing clinical reasoning expertise through the lens of 3M
- Short Communication
- The foundations of the diagnostic error movement: a tribute to Eta Berner, PhD
- Reviews
- Interventions to improve timely cancer diagnosis: an integrative review
- Technical aspects and clinical applications of synthetic MRI: a scoping review
- Mini Review
- Challenges and barriers for the adoption of personalized medicine in Europe: the case of Oncotype DX Breast Recurrence Score® test
- Opinion Papers
- Beyond thinking fast and slow: a Bayesian intuitionist model of clinical reasoning in real-world practice
- Diagnostic scope: the AI can’t see what the mind doesn’t know
- Guidelines and Recommendations
- CDC’s Core Elements to promote diagnostic excellence
- Original Articles
- Trends of diagnostic adverse events in hospital deaths: longitudinal analyses of four retrospective record review studies
- The effect of a provisional diagnosis on intern diagnostic reasoning: a mixed methods study
- On context specificity and management reasoning: moving beyond diagnosis
- Diagnostic errors in patients admitted directly from new outpatient visits
- Breaking the guidelines: how financial unawareness fuels guideline deviations and inefficient DVT diagnostics
- Harbingers of sepsis misdiagnosis among pediatric emergency department patients
- Factors affecting diagnostic difficulties in aseptic meningitis: a retrospective observational study
- Prenatal diagnostic errors in hemoglobin Bart’s hydrops fetalis caused by rare genetic interactions of α-thalassemia
- Screening fasting glucose before the OGTT: near-patient glucometer- or laboratory-based measurement?
- Three-way comparison of different ESR measurement methods and analytical performance assessment of TEST1 automated ESR analyzer
- Short Communications
- Medical language matters: impact of clinical summary composition on a generative artificial intelligence’s diagnostic accuracy
- Impact of meta-memory techniques in generating effective differential diagnoses in a pediatric core clerkship