Home Validity and reliability of Brier scoring for assessment of probabilistic diagnostic reasoning
Article
Licensed
Unlicensed Requires Authentication

Validity and reliability of Brier scoring for assessment of probabilistic diagnostic reasoning

  • Nathan Stehouwer ORCID logo EMAIL logo , Anastasia Rowland-Seymour , Larry Gruppen , Jeffrey M. Albert and Kelli Qua
Published/Copyright: October 16, 2024

Abstract

Objectives

Educators need tools for the assessment of clinical reasoning that reflect the ambiguity of real-world practice and measure learners’ ability to determine diagnostic likelihood. In this study, the authors describe the use of the Brier score to assess and provide feedback on the quality of probabilistic diagnostic reasoning.

Methods

The authors describe a novel format called Diagnostic Forecasting (DxF), in which participants read a brief clinical case and assign a probability to each item on a differential diagnosis, order tests and select a final diagnosis. DxF was piloted in a cohort of senior medical students. DxF evaluated students’ answers with Brier scores, which compare probabilistic forecasts with case outcomes. The validity of Brier scores in DxF was assessed by comparison to subsequent decision-making in the game environment of DxF, as well as external criteria including medical knowledge tests and performance on clinical rotations.

Results

Brier scores were statistically significantly correlated with diagnostic accuracy (95 % CI −4.4 to −0.44) and with mean scores on the National Board of Medical Examiners (NBME) shelf exams (95 % CI −474.6 to −225.1). Brier scores did not correlate with clerkship grades or performance on a structured clinical skills exam. Reliability as measured by within-student correlation was low.

Conclusions

Brier scoring showed evidence for validity as a measurement of medical knowledge and predictor of clinical decision-making. Further work must evaluated the ability of Brier scores to predict clinical and workplace-based outcomes, and develop reliable approaches to measuring probabilistic reasoning.


Corresponding author: Nathan Stehouwer, MD, University Hospitals Cleveland Medical Center and Rainbow Babies & Children’s Hospital, Cleveland, OH, USA; and Case Western Reserve University School of Medicine, Cleveland, OH, USA, E-mail:

Funding source: University Hospitals Graduate Medical Education

Award Identifier / Grant number: Innovation Award # P0478

Funding source: Zucker Neurology Fund

Acknowledgments

The authors thank Lauren Shurtleff, Paul Shaniuk, and Arsalan Derakhshan for authoring cases utilized in this study. They have been informed that they are being acknowledged for their contributions.

  1. Research ethics: This study was reviewed and approved by the Case Western Reserve University School of Medicine Institutional Review Board (IRB#20210682).

  2. Informed consent: Informed consent was obtained from all individuals included in this study.

  3. Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission. NS: Design, implementation, analysis, writing. ARS: Design, implementation. LG: Design, analysis, editing. JA: Statistical analysis, writing, editing. KQ: Design, implementation, analysis, editing.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: This work was supported in part by University Hospitals Graduate Medical Education Innovation Award # P0478 as well as funding from the private Zucker Neurology Fund which supported application development.

  7. Data availability: The raw data can be obtained on request from the corresponding author.

References

1. Altkorn, D. Chapter 1–5: the threshold model: conceptualizing probabilities. In: Stern, S, Cifu, A, Altcorn, D, editors. Symptom to diagnosis: an evidence-based guide, 4th ed. New York: McGraw Hill; 2020.Search in Google Scholar

2. Custers, EJFM. Thirty years of illness scripts: theoretical origins and practical applications. Med Teach 2015;37:457–62. https://doi.org/10.3109/0142159X.2014.956052.Search in Google Scholar PubMed

3. Davidoff, F, Goodspeed, R, Clive, J. Changing test ordering behavior. A randomized controlled trial comparing probabilistic reasoning with cost-containment education. Med Care 1989;27:45–58. https://doi.org/10.1097/00005650-198901000-00005.Search in Google Scholar

4. Diamond, GA, Forrester, JS, Hirsch, M, Staniloff, HM, Berman, DS, Swan, HJC, et al.. Application of conditional probability analysis to the clinical diagnosis of coronary artery disease. J Clin Invest 1980;65:1210–21. https://doi.org/10.1172/jci109776.Search in Google Scholar

5. Bowen, JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med 2006;355:2217–25. https://doi.org/10.1056/NEJMra054782.Search in Google Scholar PubMed

6. Marcum, JA. An integrated model of clinical reasoning: dual-process theory of cognition and metacognition. J Eval Clin Pract 2012;18:954–61. https://doi.org/10.1111/j.1365-2753.2012.01900.x.Search in Google Scholar PubMed

7. Kahneman, D. Thinking, fast and slow. New York: Farrar, Straus and Giroux; 2011.Search in Google Scholar

8. Tetlock, P, Gardner, D. Superforecasting. New York: Crown Publishers; 2015.Search in Google Scholar

9. Gill, CJ, Sabin, L, Schmid, CH. Why clinicians are natural bayesians. BMJ 2005;330:1080–3. https://doi.org/10.1136/bmj.330.7499.1080.Search in Google Scholar PubMed PubMed Central

10. Langarizadeh, M, Moghbeli, F. Applying naive bayesian networks to disease prediction: a systematic review. Acta Inf Med 2016;24:364–9. https://doi.org/10.5455/aim.2016.24.364-369.Search in Google Scholar PubMed PubMed Central

11. Goodman, KE, Rodman, AM, Morgan, DJ. Preparing physicians for the clinical algorithm era. N Engl J Med 2023;389:483–7. https://doi.org/10.1056/NEJMp2304839.Search in Google Scholar PubMed

12. Morgan, DJ, Pineles, L, Owczarzak, J, Magder, L, Scherer, L, Brown, JP, et al.. Accuracy of practitioner estimates of probability of diagnosis before and after testing. JAMA Intern Med 2021;181:747. https://doi.org/10.1001/jamainternmed.2021.0269.Search in Google Scholar PubMed PubMed Central

13. Custers, EJFM, Boshuizen, HPA, Schmidt, HG. The influence of medical expertise, case typicality, and illness script component on case processing and disease probability estimates. Mem Cognit 1996;24. https://doi.org/10.3758/bf03213301.Search in Google Scholar PubMed

14. Garbayo, LS, Harris, DM, Fiore, SM, Robinson, M, Kibble, JD. A metacognitive confidence calibration (MCC) tool to help medical students scaffold diagnostic reasoning in decision-making during high-fidelity patient simulations. Adv Physiol Educ 2023;47:71–81. https://doi.org/10.1152/advan.00156.2021.Search in Google Scholar PubMed

15. Cooke, S, Lemay, JF. Transforming medical assessment: integrating uncertainty into the evaluation of clinical reasoning in medical education. Acad Med 2017;92:746–51. https://doi.org/10.1097/ACM.0000000000001559.Search in Google Scholar PubMed

16. Helou, MA, DiazGranados, D, Ryan, MS, Cyrus, JW. Uncertainty in decision making in medicine. Acad Med 2020;95:157–65. https://doi.org/10.1097/ACM.0000000000002902.Search in Google Scholar PubMed PubMed Central

17. Lubarsky, S, Charlin, B, Cook, DA, Chalk, C, van der Vleuten, CPM. Script concordance testing: a review of published validity evidence. Med Educ 2011;45:329–38. https://doi.org/10.1111/j.1365-2923.2010.03863.x.Search in Google Scholar PubMed

18. Kün-Darbois, JD, Annweiler, C, Lerolle, N, Lebdai, S. Script concordance test acceptability and utility for assessing medical students’ clinical reasoning: a user’s survey and an institutional prospective evaluation of students’ scores. BMC Med Educ 2022;22. https://doi.org/10.1186/s12909-022-03339-1.Search in Google Scholar PubMed PubMed Central

19. Monteiro, SD, Sherbino, J, Schmidt, H, Mamede, S, Ilgen, J, Norman, G. It’s the destination: diagnostic accuracy and reasoning. Adv Health Sci Educ 2020;25:19–29. https://doi.org/10.1007/s10459-019-09903-7.Search in Google Scholar PubMed

20. Thammasitboon, S, Rencic, JJ, Trowbridge, RL, Olson, APJ, Sur, M, Dhaliwal, G. The assessment of reasoning tool (ART): structuring the conversation between teachers and learners. Diagnosis 2018;5. https://doi.org/10.1515/dx-2018-0052.Search in Google Scholar PubMed

21. Cook, DA, Brydges, R, Ginsburg, S, Hatala, R. A contemporary approach to validity arguments: a practical guide to Kane’s framework. Med Educ 2015;49:560–75. https://doi.org/10.1111/medu.12678.Search in Google Scholar PubMed

22. Kane, MT. Validating the interpretations and uses of test scores. J Educ Meas 2013;50:1–73. https://doi.org/10.1111/jedm.12000.Search in Google Scholar

23. Brier, GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950;78:1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2.10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2Search in Google Scholar

24. Murphy, AH. A new vector partition of the probability score. J Appl Meteorol 1973;12:595–600. https://doi.org/10.1175/1520-0450(1973)012<0595:anvpot>2.0.co;2.10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2Search in Google Scholar

25. Steyerberg, EW. Clinical prediction models: a practical approach to development, validation, and updating, 2nd ed. Cham, Switzerland: Springer Nature Switzerland AG; 2019.Search in Google Scholar

26. Ferro, C, Fricker, T. A bias-corrected decomposition of the Brier score. Q J R Meteorol Soc 2012;138:1954–60. https://doi.org/10.1002/qj.1924.Search in Google Scholar

27. Sunstein, C, Kahneman, D, Sibony, O. Noise: a flaw in human judgment, 1st ed. New York: Little, Brown Spark; 2021, 1.Search in Google Scholar

28. Assel, M, Sjoberg, DD, Vickers, AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res 2017;1. https://doi.org/10.1186/s41512-017-0020-3.Search in Google Scholar

29. Hrynchak, P, Glover Takahashi, S, Nayer, M. Key-feature questions for assessment of clinical reasoning: a literature review. Med Educ 2014;48:870–83. https://doi.org/10.1111/medu.12509.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/dx-2023-0109).


Received: 2023-08-18
Accepted: 2024-09-15
Published Online: 2024-10-16

© 2024 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Review
  3. Systematic review and meta-analysis of observational studies evaluating glial fibrillary acidic protein (GFAP) and ubiquitin C-terminal hydrolase L1 (UCHL1) as blood biomarkers of mild acute traumatic brain injury (mTBI) or sport-related concussion (SRC) in adult subjects
  4. Opinion Papers
  5. From stable teamwork to dynamic teaming in the ambulatory care diagnostic process
  6. Bringing team science to the ambulatory diagnostic process: how do patients and clinicians develop shared mental models?
  7. Vitamin D assay and supplementation: still debatable issues
  8. Original Articles
  9. Developing a framework for understanding diagnostic reconciliation based on evidence review, stakeholder engagement, and practice evaluation
  10. Validity and reliability of Brier scoring for assessment of probabilistic diagnostic reasoning
  11. Impact of disclosing a working diagnosis during simulated patient handoff presentation in the emergency department: correctness matters
  12. Implementation of a bundle to improve diagnosis in hospitalized patients: lessons learned
  13. Time pressure in diagnosing written clinical cases: an experimental study on time constraints and perceived time pressure
  14. A decision support system to increase the compliance of diagnostic imaging examinations with imaging guidelines: focused on cerebrovascular diseases
  15. Bridging the divide: addressing discrepancies between clinical guidelines, policy guidelines, and biomarker utilization
  16. Unnecessary repetitions of C-reactive protein and leukocyte count at the emergency department observation unit contribute to higher hospital admission rates
  17. Quality control of ultrasonography markers for Down’s syndrome screening: a retrospective study by the laboratory
  18. Short Communications
  19. Unclassified green dots on nucleated red blood cells (nRBC) plot in DxH900 from a patient with hyperviscosity syndrome
  20. Bayesian intelligence for medical diagnosis: a pilot study on patient disposition for emergency medicine chest pain
  21. Case Report – Lessons in Clinical Reasoning
  22. A delayed diagnosis of hyperthyroidism in a patient with persistent vomiting in the presence of Chiari type 1 malformation
  23. Letters to the Editor
  24. Mpox (monkeypox) diagnostic kits – September 2024
  25. Barriers to diagnostic error reduction in Japan
  26. Superwarfarin poisoning: a challenging diagnosis
  27. Reviewer Acknowledgment
  28. Reviewer Acknowledgment
Downloaded on 15.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/dx-2023-0109/html
Scroll to top button