Medical language matters: impact of clinical summary composition on a generative artificial intelligence’s diagnostic accuracy

Cassandra Skittle; Eliana Bonifacino; Casey N. McQuade

doi:10.1515/dx-2024-0167

Article

Medical language matters: impact of clinical summary composition on a generative artificial intelligence’s diagnostic accuracy

Cassandra Skittle , Eliana Bonifacino and Casey N. McQuade

Published/Copyright: December 12, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Diagnosis Volume 12 Issue 2

Abstract

Objectives

Evaluate the impact of problem representation (PR) characteristics on Generative Artificial Intelligence (GAI) diagnostic accuracy.

Methods

Internal medicine attendings and residents from two academic medical centers were given a clinical vignette and instructed to write a PR. Deductive content analysis described the characteristics comprising each PR. Individual PRs were input into ChatGPT-4 (OpenAI, September 2023) which was prompted to generate a ranked three-item differential. The ranked differential and the top-ranked diagnosis were scored on a 3-part scale, ranging from incorrect, partially correct, to correct. Logistic regression evaluated individual PR characteristic’s impact on ChatGPT accuracy.

Results

For a three-item differential, accuracy was associated with including fewer comorbidities (OR 0.57, p=0.010), fewer past historical items (OR 0.60, p=0.019), and more physical examination items (OR 1.66, p=0.015). For ChatGPT’s ability to rank the true diagnosis as the single-best diagnosis, utilizing temporal semantic qualifiers, more semantic qualifiers overall, and adhering to a typical 3-part PR format all correlated with diagnostic accuracy: OR 3.447, p=0.046; OR 1.300, p=0.005; OR 3.577, p=0.020, respectively.

Conclusions

Several distinct PR factors improved ChatGPT diagnostic accuracy. These factors have previously been associated with expertise in creating PR. Future studies should explore how clinical input qualities affect GAI diagnostic accuracy prospectively.

Keywords: problem representation; generative artificial intelligence; chatGPT; prompt engineering; clinical reasoning

Corresponding author: Cassandra Skittle, MD, MBA, University of Colorado Anschutz Medical Campus, 12401 East 17th Avenue, 4th Floor 80045, Aurora, CO, USA, E-mail: Cassandra.skittle@cuanschutz.edu

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: None declared.
Data availability: Not applicable.

Appendix

Vignette

Chief Complaint: “I’m having trouble breathing.”

History of Present Illness: A 47 year old Caucasian woman is admitted with new onset shortness of breath.

The patient states that 3 months ago, she started noticing that she was getting winded going up the stairs in her house. Since then, her shortness of breath has gotten worse and she now gets short of breath with just walking from her bedroom to the bathroom. She has noticed that her pants feel tighter than usual and that her stomach is bloated. Her ankles also feel swollen. Over the last 3 days, she started noticing some sharp, nonradiating chest pains under her breastbone with more strenuous activities. These latest symptoms prompted her presentation to the emergency department.

She overall has been trying to lose weight and does not think she is pregnant. She has occasionally felt lightheaded during exertion but denies feeling short of breath with lying down or awaking from sleep feeling dyspneic. She denies fevers, chills, nausea, vomiting, diarrhea, abdominal pain, blood in her stool. She has a chronic cough rarely productive of scant sputum (what she calls her “smoker’s cough”) which is unchanged from baseline.

Upon presentation to the emergency department, she reports feeling some mild dyspnea at rest but no chest pain.

Past Medical History: [Items relevant to a pulmonary hypertension diagnosis: hypertension, tobacco use]

Anxiety; Hypertension; Opioid use disorder with injection heroin use, in remission; Polycystic ovarian syndrome; Vasovagal syncope.

Past Surgical History: Cholecystectomy 5 years ago.

Family History: Mother: hypertension; Father: prostate cancer, coronary artery disease (no myocardial infarction history); No other history of malignancy.

Social History: Lives in rural Pennsylvania with her mother. Works as a waitress. Currently smokes tobacco cigarettes, 2 packs per day for the last 30 years. She denies vaping, alcohol use. She reports abstinence from recreational drugs for 15 years.

Review of Systems: As mentioned per HPI.

Medications: Losartan; Suboxone

Physical Examination:

Temp: 36.6 C, BP: 145/86, Pulse: 102, RR: 18, SpO2: 92 % on 4 L/m, BMI 31.6

General: Appears stated age, no acute distress

HEENT: Sclera are anicteric, moist mucosae, oropharynx clear. Facial plethora.

Lymph: No palpable lymphadenopathy.

Pulmonary: Lungs are clear to auscultation, symmetric chest expansion. Use of accessory muscles of respirations noted.

Cardiac: Regular rate and rhythm. No murmurs, rubs, or gallops. JVP is 13 cm at 30*. 2+ pitting edema bilaterally to knees.

Gastrointestinal: Normoactive bowel sounds. Nontender. Distention present with a fluid wave. No palpable organomegaly.

Extremities: Bilateral digital clubbing of the hands.

Neurologic: Alert and Oriented ×3. No asterixis. CN 2–12 are intact. 5/5 strength throughout. Gait normal but evaluation is limited as patient becomes visibly dyspneic.

Other Studies:

Na: 133; Cl:82; K:4.2; CO2: 42; BUN: 9; Cr: 0.6; Glu: 126

WBC: 9.2; Hgb: 13.8; Plt: 205; INR:1.3;

ALT: 8; AST: 13; Alk Phos: 135; Alb: 3.4; TP: 6.6; Total Bilirubin: 1.9

Urine pregnancy testing: negative; Troponin: negative

EKG: Sinus tachycardia with poor R-wave progression and inferior T-wave inversions that are new since last EKG.

Chest X-ray, PA and Lateral views: No acute pulmonary disease. Heart is of normal size.

References

1. Kanjee, Z, Crowe, B, Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78.10.1001/jama.2023.8288Search in Google Scholar PubMed PubMed Central

2. Kung, TH, Cheatham, M, Medenilla, A, Sillos, C, De Leon, L, Elepaño, C, et al.. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. https://doi.org/10.1371/journal.pdig.0000198.Search in Google Scholar PubMed PubMed Central

3. Eriksen, AV, Möller, S, Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 1 2023:AIp2300031.10.1056/AIp2300031Search in Google Scholar

4. Savage, T, Nayak, A, Gallo, R, Rangan, E, Chen, JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. Npj Digit Med. 2024;7:1–7. https://doi.org/10.1038/s41746-024-01010-1.Search in Google Scholar PubMed PubMed Central

5. McQuade, C, Simonson, MG, Lister, J, Olson, APJ, Zwaan, L, Rothenberger, S, et al.. What makes a good problem representation? Characteristics differentiating problem representation synthesis between novices and experts. J Hosp Med 2024;19:468–474. https://doi.org/10.1002/jhm.13335.Search in Google Scholar PubMed

6. Mamede, S, van Gog, T, van den Berge, K, Rikers, RMJP, van Saase, JLCM, van Guldener, C, et al.. Effect of availability bias and reflective reasoning on diagnostic accuracy among internal medicine residents. JAMA 2010;304:1198–203. https://doi.org/10.1001/jama.2010.1276.Search in Google Scholar PubMed

7. Bordage, G, Lemieux, M. Semantic structures and diagnostic thinking of experts and novices. Acad Med J Assoc Am Med Coll 1991;66:S70–72. https://doi.org/10.1097/00001888-199109001-00025.Search in Google Scholar

8. How ChatGPT and our language models are developed | OpenAI Help Center. https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed.Search in Google Scholar

Received: 2024-10-28

Accepted: 2024-11-05

Published Online: 2024-12-12

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/dx-2024-0167

Keywords for this article

problem representation; generative artificial intelligence; chatGPT; prompt engineering; clinical reasoning