Validity and reliability of Brier scoring for assessment of probabilistic diagnostic reasoning

Nathan Stehouwer; Anastasia Rowland-Seymour; Larry Gruppen; Jeffrey M. Albert; Kelli Qua

doi:10.1515/dx-2023-0109

Enjoy 40% off

academic books on De Gruyter Brill *

Article

Validity and reliability of Brier scoring for assessment of probabilistic diagnostic reasoning

Nathan Stehouwer , Anastasia Rowland-Seymour , Larry Gruppen , Jeffrey M. Albert and Kelli Qua

Published/Copyright: October 16, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Diagnosis Volume 12 Issue 1

Abstract

Objectives

Educators need tools for the assessment of clinical reasoning that reflect the ambiguity of real-world practice and measure learners’ ability to determine diagnostic likelihood. In this study, the authors describe the use of the Brier score to assess and provide feedback on the quality of probabilistic diagnostic reasoning.

Methods

The authors describe a novel format called Diagnostic Forecasting (DxF), in which participants read a brief clinical case and assign a probability to each item on a differential diagnosis, order tests and select a final diagnosis. DxF was piloted in a cohort of senior medical students. DxF evaluated students’ answers with Brier scores, which compare probabilistic forecasts with case outcomes. The validity of Brier scores in DxF was assessed by comparison to subsequent decision-making in the game environment of DxF, as well as external criteria including medical knowledge tests and performance on clinical rotations.

Results

Brier scores were statistically significantly correlated with diagnostic accuracy (95 % CI −4.4 to −0.44) and with mean scores on the National Board of Medical Examiners (NBME) shelf exams (95 % CI −474.6 to −225.1). Brier scores did not correlate with clerkship grades or performance on a structured clinical skills exam. Reliability as measured by within-student correlation was low.

Conclusions

Brier scoring showed evidence for validity as a measurement of medical knowledge and predictor of clinical decision-making. Further work must evaluated the ability of Brier scores to predict clinical and workplace-based outcomes, and develop reliable approaches to measuring probabilistic reasoning.

Keywords: probabilistic reasoning; diagnostic reasoning; assessment; uncertainty

Corresponding author: Nathan Stehouwer, MD, University Hospitals Cleveland Medical Center and Rainbow Babies & Children’s Hospital, Cleveland, OH, USA; and Case Western Reserve University School of Medicine, Cleveland, OH, USA, E-mail: nathan.stehouwer@uhhospitals.org

Funding source: University Hospitals Graduate Medical Education

Award Identifier / Grant number: Innovation Award # P0478

Funding source: Zucker Neurology Fund

Acknowledgments

The authors thank Lauren Shurtleff, Paul Shaniuk, and Arsalan Derakhshan for authoring cases utilized in this study. They have been informed that they are being acknowledged for their contributions.

Research ethics: This study was reviewed and approved by the Case Western Reserve University School of Medicine Institutional Review Board (IRB#20210682).
Informed consent: Informed consent was obtained from all individuals included in this study.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission. NS: Design, implementation, analysis, writing. ARS: Design, implementation. LG: Design, analysis, editing. JA: Statistical analysis, writing, editing. KQ: Design, implementation, analysis, editing.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: This work was supported in part by University Hospitals Graduate Medical Education Innovation Award # P0478 as well as funding from the private Zucker Neurology Fund which supported application development.
Data availability: The raw data can be obtained on request from the corresponding author.

References

1. Altkorn, D. Chapter 1–5: the threshold model: conceptualizing probabilities. In: Stern, S, Cifu, A, Altcorn, D, editors. Symptom to diagnosis: an evidence-based guide, 4th ed. New York: McGraw Hill; 2020.Search in Google Scholar

2. Custers, EJFM. Thirty years of illness scripts: theoretical origins and practical applications. Med Teach 2015;37:457–62. https://doi.org/10.3109/0142159X.2014.956052.Search in Google Scholar PubMed

3. Davidoff, F, Goodspeed, R, Clive, J. Changing test ordering behavior. A randomized controlled trial comparing probabilistic reasoning with cost-containment education. Med Care 1989;27:45–58. https://doi.org/10.1097/00005650-198901000-00005.Search in Google Scholar

4. Diamond, GA, Forrester, JS, Hirsch, M, Staniloff, HM, Berman, DS, Swan, HJC, et al.. Application of conditional probability analysis to the clinical diagnosis of coronary artery disease. J Clin Invest 1980;65:1210–21. https://doi.org/10.1172/jci109776.Search in Google Scholar

5. Bowen, JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med 2006;355:2217–25. https://doi.org/10.1056/NEJMra054782.Search in Google Scholar PubMed

6. Marcum, JA. An integrated model of clinical reasoning: dual-process theory of cognition and metacognition. J Eval Clin Pract 2012;18:954–61. https://doi.org/10.1111/j.1365-2753.2012.01900.x.Search in Google Scholar PubMed

7. Kahneman, D. Thinking, fast and slow. New York: Farrar, Straus and Giroux; 2011.Search in Google Scholar

8. Tetlock, P, Gardner, D. Superforecasting. New York: Crown Publishers; 2015.Search in Google Scholar

9. Gill, CJ, Sabin, L, Schmid, CH. Why clinicians are natural bayesians. BMJ 2005;330:1080–3. https://doi.org/10.1136/bmj.330.7499.1080.Search in Google Scholar PubMed PubMed Central

10. Langarizadeh, M, Moghbeli, F. Applying naive bayesian networks to disease prediction: a systematic review. Acta Inf Med 2016;24:364–9. https://doi.org/10.5455/aim.2016.24.364-369.Search in Google Scholar PubMed PubMed Central

11. Goodman, KE, Rodman, AM, Morgan, DJ. Preparing physicians for the clinical algorithm era. N Engl J Med 2023;389:483–7. https://doi.org/10.1056/NEJMp2304839.Search in Google Scholar PubMed

12. Morgan, DJ, Pineles, L, Owczarzak, J, Magder, L, Scherer, L, Brown, JP, et al.. Accuracy of practitioner estimates of probability of diagnosis before and after testing. JAMA Intern Med 2021;181:747. https://doi.org/10.1001/jamainternmed.2021.0269.Search in Google Scholar PubMed PubMed Central

13. Custers, EJFM, Boshuizen, HPA, Schmidt, HG. The influence of medical expertise, case typicality, and illness script component on case processing and disease probability estimates. Mem Cognit 1996;24. https://doi.org/10.3758/bf03213301.Search in Google Scholar PubMed

14. Garbayo, LS, Harris, DM, Fiore, SM, Robinson, M, Kibble, JD. A metacognitive confidence calibration (MCC) tool to help medical students scaffold diagnostic reasoning in decision-making during high-fidelity patient simulations. Adv Physiol Educ 2023;47:71–81. https://doi.org/10.1152/advan.00156.2021.Search in Google Scholar PubMed

15. Cooke, S, Lemay, JF. Transforming medical assessment: integrating uncertainty into the evaluation of clinical reasoning in medical education. Acad Med 2017;92:746–51. https://doi.org/10.1097/ACM.0000000000001559.Search in Google Scholar PubMed

16. Helou, MA, DiazGranados, D, Ryan, MS, Cyrus, JW. Uncertainty in decision making in medicine. Acad Med 2020;95:157–65. https://doi.org/10.1097/ACM.0000000000002902.Search in Google Scholar PubMed PubMed Central

17. Lubarsky, S, Charlin, B, Cook, DA, Chalk, C, van der Vleuten, CPM. Script concordance testing: a review of published validity evidence. Med Educ 2011;45:329–38. https://doi.org/10.1111/j.1365-2923.2010.03863.x.Search in Google Scholar PubMed

18. Kün-Darbois, JD, Annweiler, C, Lerolle, N, Lebdai, S. Script concordance test acceptability and utility for assessing medical students’ clinical reasoning: a user’s survey and an institutional prospective evaluation of students’ scores. BMC Med Educ 2022;22. https://doi.org/10.1186/s12909-022-03339-1.Search in Google Scholar PubMed PubMed Central

19. Monteiro, SD, Sherbino, J, Schmidt, H, Mamede, S, Ilgen, J, Norman, G. It’s the destination: diagnostic accuracy and reasoning. Adv Health Sci Educ 2020;25:19–29. https://doi.org/10.1007/s10459-019-09903-7.Search in Google Scholar PubMed

20. Thammasitboon, S, Rencic, JJ, Trowbridge, RL, Olson, APJ, Sur, M, Dhaliwal, G. The assessment of reasoning tool (ART): structuring the conversation between teachers and learners. Diagnosis 2018;5. https://doi.org/10.1515/dx-2018-0052.Search in Google Scholar PubMed

21. Cook, DA, Brydges, R, Ginsburg, S, Hatala, R. A contemporary approach to validity arguments: a practical guide to Kane’s framework. Med Educ 2015;49:560–75. https://doi.org/10.1111/medu.12678.Search in Google Scholar PubMed

22. Kane, MT. Validating the interpretations and uses of test scores. J Educ Meas 2013;50:1–73. https://doi.org/10.1111/jedm.12000.Search in Google Scholar

23. Brier, GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950;78:1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2.10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2Search in Google Scholar

24. Murphy, AH. A new vector partition of the probability score. J Appl Meteorol 1973;12:595–600. https://doi.org/10.1175/1520-0450(1973)012<0595:anvpot>2.0.co;2.10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2Search in Google Scholar

25. Steyerberg, EW. Clinical prediction models: a practical approach to development, validation, and updating, 2nd ed. Cham, Switzerland: Springer Nature Switzerland AG; 2019.Search in Google Scholar

26. Ferro, C, Fricker, T. A bias-corrected decomposition of the Brier score. Q J R Meteorol Soc 2012;138:1954–60. https://doi.org/10.1002/qj.1924.Search in Google Scholar

27. Sunstein, C, Kahneman, D, Sibony, O. Noise: a flaw in human judgment, 1st ed. New York: Little, Brown Spark; 2021, 1.Search in Google Scholar

28. Assel, M, Sjoberg, DD, Vickers, AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res 2017;1. https://doi.org/10.1186/s41512-017-0020-3.Search in Google Scholar

29. Hrynchak, P, Glover Takahashi, S, Nayer, M. Key-feature questions for assessment of clinical reasoning: a literature review. Med Educ 2014;48:870–83. https://doi.org/10.1111/medu.12509.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/dx-2023-0109).

Received: 2023-08-18

Accepted: 2024-09-15

Published Online: 2024-10-16

You are currently not able to access this content.

Supplementary Material

Articles in the same Issue

https://doi.org/10.1515/dx-2023-0109

Keywords for this article

probabilistic reasoning; diagnostic reasoning; assessment; uncertainty