Home “Cephalgia” or “migraine”? Solving the headache of assessing clinical reasoning using natural language processing
Article Open Access

“Cephalgia” or “migraine”? Solving the headache of assessing clinical reasoning using natural language processing

  • Christopher R. Runyon ORCID logo EMAIL logo , Polina Harik and Michael A. Barone
Published/Copyright: November 21, 2022

Abstract

In this op-ed, we discuss the advantages of leveraging natural language processing (NLP) in the assessment of clinical reasoning. Clinical reasoning is a complex competency that cannot be easily assessed using multiple-choice questions. Constructed-response assessments can more directly measure important aspects of a learner’s clinical reasoning ability, but substantial resources are necessary for their use. We provide an overview of INCITE, the Intelligent Clinical Text Evaluator, a scalable NLP-based computer-assisted scoring system that was developed to measure clinical reasoning ability as assessed in the written documentation portion of the now-discontinued USMLE Step 2 Clinical Skills examination. We provide the rationale for building a computer-assisted scoring system that is aligned with the intended use of an assessment. We show how INCITE’s NLP pipeline was designed with transparency and interpretability in mind, so that every score produced by the computer-assisted system could be traced back to the text segment it evaluated. We next suggest that, as a consequence of INCITE’s transparency and interpretability features, the system may easily be repurposed for formative assessment of clinical reasoning. Finally, we provide the reader with the resources to consider in building their own NLP-based assessment tools.

Background

Clinical reasoning is one of the most important foundational skills for effective clinical practice. While often referred to as a singular skill or ability, modern definitions of clinical reasoning highlight that it is a multi-faceted process composed of component processes and behaviors [1, 2] such as diagnostic reasoning, management reasoning, and/or therapeutic reasoning [3, 4]. The measurement of such a complex phenomenon is therefore no easy task. Multiple-choice questions (MCQs) are often used in medical education, but, as Daniel and coauthors [5] detail, this assessment format is only effective in measuring some aspects of clinical reasoning. Constructed-response assessments, such as short-answer questions, essay questions, modified essay questions, and post-encounter patient notes can more directly capture important aspects of a learner’s clinical reasoning ability.

While the benefits of constructed responses make them attractive for assessment, one of the barriers to their use at scale is the resources required to score these assessments. In their simplest and most brief form, short-answer questions that follow a clinical vignette or patient chart may ask a learner to identify a single diagnosis, test, or procedure that can be quickly scored by an instructor or trained assistant. When short-answer questions (or essays) ask the learner to provide a response that is more sophisticated than a single phrase (such as the relationships between several concepts) or the expected response can be equally expressed many ways (many lexical variants exist), substantially more resources are required for scoring.

Natural language processing (NLP) can be leveraged to reduce the time necessary to score constructed responses while capturing evidence of clinical reasoning that cannot be assessed through MCQs. Here we define NLP as the ability to utilize computers to process text. This broad definition encompasses such tasks as parsing sentences into parts of speech, identifying lexical variants of the same concept, automatic speech recognition (as used by many smartphones and household devices), deriving meaning from text, and question answering (such as online customer service bots). The field of NLP is at the intersection of linguistics and computer science, and often employs the use of statistical modeling.

Advances in computational capacity, computational linguistics, and machine learning have made the use of computer-assisted scoring of constructed responses a viable component of any assessment program. Implementing an NLP-based computer-assisted scoring system is not trivial, but interest is growing due to the broad applicability of NLP and the availability of new technology and capabilities. This interest is demonstrated by the sharp increase in submissions to the most popular NLP conferences and journals, with many receiving 200–300% of the submissions that were received just a few years ago [6].

The benefits of investing in an NLP scoring system are at least two-fold. First, NLP systems can reduce the amount of time and resources necessary to operationalize constructed-response assessment formats. A second, and potentially even more compelling reason, is the consistency in scoring afforded by the NLP system. Consistency in scoring not only promotes fairness in assessment, but also can provide more objective indicators of student growth and achievement than those based on individual judgment. Human raters are subject to natural within-person scoring variability over time; when multiple people are scoring the same assessments, variation in the application of the scoring rubrics can naturally occur, even with rigorous oversight of scoring procedures [7]. These are but a few of the examples of known issues inherent with using human raters for scoring [8, 9].

Many thorough descriptions of computer-assisted scoring systems exist [10, 11], including important evaluation criterion to ensure that the system is producing appropriate scores [12]. Here we do not attempt to replicate these excellent resources for learning about computer-assisted scoring, but instead focus on the aspects of the computer-assisted scoring system that highlight the utility of the NLP component of the system for the assessment of clinical reasoning. In any case, the goal of any computer-assisted scoring system is to consistently produce scores that are like those received from human scoring [10].

The remainder of this article describes a computer-assisted scoring system that was developed to score constructed responses on a high-stakes medical licensure exam – the post-encounter patient note (PN) that was written in the now-discontinued USMLE Step 2 Clinical Skills examination [13]. The purpose of this scoring system was to identify high proficiency examinees that scored substantially above the minimum passing point, and thereby eliminating the need for human scoring of those examinees. The system was designed to reduce the number of PNs that needed to be reviewed by human raters by about half; a reduction of more than 150,000 PNs from a total of approximately 330,000 PNs that were graded annually. Each computer-assisted scoring system necessarily has unique components or design considerations to make it optimized for its specific purpose. Here we provide rationale for the design decisions we made in the development of our system to allow readers to better understand how the workflow of one particular system may be altered to build other unique NLP-based computer-assisted scoring systems.

An NLP-based computer-assisted system to score patient notes

The basis of any computer-assisted scoring system is a set of previously scored assessments. These scored assessments are referred to as labeled data – the items (text, in this instance) from the assessment are the data, and the accompanying expert score (or classification – e.g., pass/fail) is the label for that data. With enough labeled data, computer-assisted scoring systems can mathematically derive the relationship between the data and the label, and then apply this algorithm to label new, unscored data. In the present case, our labeled data was previously scored PNs from Step 2 CS examinations, as the PN provided important information about the examinee’s clinical reasoning ability.

NLP systems transform the raw text input into a set of explanatory variables that provide information relevant to the outcome score or classification [14]. In the machine learning literature these variables are called features; we use this terminology hereafter as well. These features may be general characteristics of the data, or they may be specifically engineered to represent some aspect of text that is related to the score or classification (such as grammar, style, and organization in the assessment of essay writing [15]). Feature engineering is one of the most important and most time-consuming aspects of building an interpretable NLP-based computer-assisted scoring system.

It is important to note that labeled data and features should be reviewed to ensure that no bias exists in either the score modeling algorithm or in any machine-derived features. Any bias that exists in labeled data will be perpetuated (or potentially exacerbated) by automated scoring algorithms [16]. A responsible computer-assisted scoring system follows all of the best practices for fairness in assessment [17], including examining scores coming from the system to ensure they do not disadvantage any groups of learners [18]. Given that our PNs do not contain any identifiable student data when sent to physician-raters for scoring, and that we were not going to be utilizing any machine-derived features, we felt comfortable that our computer-assisted scoring algorithm was not perpetuating any bias. We also empirically confirmed that the computer-assisted scoring was not biasing examinee scores based on any demographic characteristics after the computer-assisted scoring algorithm had been developed.

The object of assessment: post-encounter patient note

The USMLE Step 2 Clinical Skills examination consisted of 12 unique structured clinical encounters between an examinee and a standardized patient. The purpose of each encounter was to broadly gather information on an examinee’s clinical skills, including several aspects related to the measurement of clinical reasoning: information gathering, developing a differential diagnosis, selecting a leading diagnosis, providing justification for the diagnosis provided, and selecting appropriate diagnostic studies (when applicable) [5]. Based on an examination blueprint, individual standardized patients were trained to portray a clinical condition defined by a set of key essentials, which are characteristics that are consistent with the clinical condition and are crucial to the development of the case-specific differential diagnosis. After a 15-minute encounter with the standardized patient, the examinee was given 10 minutes to document the encounter in a structured document – the PN – intended as written communication, such that another physician would be able to read the PN and be adequately informed of the standardized patient’s clinical condition. The PN was one of the main objects of assessment in the USMLE Step 2 CS examination.

The PN consisted of four sections: (1) history findings; (2) physical examinations; (3) diagnoses (and supporting details); and (4) recommended diagnostic tests. The history section of the patient note was where the examinee documented the presenting symptoms and relevant patient history that was extracted in the encounter and provided evidence of the examinee’s information gathering skills. In the physical examination section, the examinee documented what physical exam elements were performed, along with the findings – assessing hypothesis-driven inquiry and information gathering. The diagnoses section allowed examinees to enter up to three potential diagnoses in decreasing order of their plausibility; the cases were designed such that anywhere from 1-4 diagnoses could be generated, allowing for an examinee’s differential diagnoses to be assessed. The examinee also provided supporting information for each listed diagnosis, which provided information on their ability to justify their clinical decision. The final section, diagnostic tests, was where the examinee indicated what tests were indicated to confirm or further investigate the clinical condition(s) portrayed by the standardized patient.

An important aspect of the PN was its structure. Specific areas of the note were associated with specific portions of the patient encounter, which were further associated with specific aspects of clinical reasoning. Although the patient history and physical exam sections themselves were unstructured – examinees were free to enter the information obtained during the encounter however they feel appropriate, such as checklists or short notes – the information in these sections had some expected content. This significantly reduces the complexity of the NLP system, as the system does not need to disentangle the free text into the corresponding sections. Having the PN structured in this manner allowed us to capture different aspects of clinical reasoning. For example, the system ‘knows’ what information was provided as justification for a diagnosis because of where the text was entered into the note. This contrasts with NLP systems that identify reasoning via analysis of the predicate argument structure of the writing [19, 20], a task that increases in difficulty with the complexity of the writing.

PNs were assessed by physician-raters trained on the key essentials of the case. Two scores were provided by the physician raters, one for data gathering (DG) and one for data interpretation (DI) [21]. The score for the data gathering section was guided by how well the examinee was able to document the pertinent details about the standardized patient’s clinical condition in the patient history and physical examination sections. Higher DG scores reflected those PNs that contain higher quality information: a majority (if not all) of the pertinent key features of the patient presentation. Higher DI scores were associated with providing the expected diagnoses, the correct ordering of the diagnoses (when applicable), and the supporting evidence for each diagnosis (gathered from history and clinical presentation). More detail on the scoring of each section is provided below.

INCITE: the natural language processing component

The Intelligent Clinical Text Evaluator (INCITE) NLP driver was developed for computer-assisted scoring of patient notes [22]. At a very high level, INCITE scans the text entered in the patient note to identify if specific pre-defined concepts are present. The key essentials are the information most relevant to assessing a learner’s clinical reasoning ability through the PN, and, as such, became the features relevant for the scoring of that case. A brief description of the INCITE engine follows; more granular detail can be found elsewhere [22].

To over-simplify its complexity, INCITE is a sophisticated text matching system that examines the PN to identify the key essentials and diagnoses that are present in the note. The system sequentially progresses from exact matches of the key essential text to increasingly “fuzzy” or loose matches of the given key essential; see Figure 1 for a visual representation of the computer-assisted scoring pipeline. Once a term has been detected, the search for that term stops. For example, let us say that “nocturnal cough” is an important key feature of a patient’s presentation and is expected to be reported in the patient history portion of the PN. INCITE first scans the PN to see if the exact phrase “nocturnal cough” is present in the patient note in the history section. If undetected, INCITE then proceeds to see if close string matches for “nocturnal cough” are present – this accounts for possible minor misspellings or typos of the phrase (spelling is not assessed in the PN), such as “nocturnal cougf.”

Figure 1: 
Overview of the computer-assisted scoring pipeline. After being prepared for the system, patient notes are first scanned to see if any exact or fuzzy matches of key essential concepts are possible (1). Those concepts that are not detected are then searched for in the annotations and lexical variant library (2), first through exact matching and then fuzzy matching. A pre-calibrated scoring algorithm is then applied to the set of concepts detected in the patient note to arrive at a note score. The note scores are then combined into an overall patient note score. If the overall patient note score is high, the examinee is considered a “Pass.” If the overall patient note score is low or if the system has any difficulty processing the note, these patient notes are then routed to physician-raters for review.
Figure 1:

Overview of the computer-assisted scoring pipeline. After being prepared for the system, patient notes are first scanned to see if any exact or fuzzy matches of key essential concepts are possible (1). Those concepts that are not detected are then searched for in the annotations and lexical variant library (2), first through exact matching and then fuzzy matching. A pre-calibrated scoring algorithm is then applied to the set of concepts detected in the patient note to arrive at a note score. The note scores are then combined into an overall patient note score. If the overall patient note score is high, the examinee is considered a “Pass.” If the overall patient note score is low or if the system has any difficulty processing the note, these patient notes are then routed to physician-raters for review.

If no direct matches or fuzzy string matches are made to the original key essential phrase, INCITE proceeds by trying to match lexical variants of the key essential phrase. The lexical variant dictionary was built by training physicians to annotate a sample of previously scored PNs. During the process, annotators identify phrases that convey the same intent as the key essential phrase but through different word choices; night cough, evening cough, and wakes up at night coughing may all be identified as suitable lexical variants that indicate the reporting of a nocturnal cough. INCITE then proceeds as above with the lexical variants: first by attempting to identify exact matches, and then by looking for fuzzy matches to the lexical variants. See Figure 2 for an example of a PN with lexical variants mapped to the key essential phrases.

Figure 2: 
Example mapping of phrases in a patient note (left) to the corresponding key essential phrases in the case blueprint (right).
Figure 2:

Example mapping of phrases in a patient note (left) to the corresponding key essential phrases in the case blueprint (right).

This lexical variant dictionary is further augmented during a quality control check of the system. During this quality control check, DG notes with high scores where the specific essential phrase is not detected are examined to see if the concept is truly missing or if another lexical variant needs to be added to the dictionary. In the case of the latter, these lexical variants are reviewed by a subject-matter expert and then added to the dictionary. Similarly, DG notes with low scores where the specific key essential phrase was detected by INCITE are examined for potential errors in the annotations. This process is iterated until INCITE detects the key essential concepts at roughly the same rate that the key essential concepts were identified in the physician-rater annotated PNs.

Data gathering

The result of the INCITE program for the DG section is a set of features indicating the presence or absence of each key essential for a given clinical scenario (Further detail on the performance of the system for detecting clinical concepts is forthcoming [23].) This set of features is then paired with the physician-rater score for that note so the relationship between the features and score can be modeled. The choice of statistical model in machine learning is important. There is a wide spectrum of interpretable and ‘black box’ models that result in differing degrees of accuracy [24].

For USMLE Step 2 CS, we chose to use linear regression to model the relationship between the features and human rating. This choice was made to maximize score transparency and interpretability. Linear regression produces a regression coefficient for each of the included features; this is the amount by which an examinee’s score should be increased if the associated key essential is detected in the DG section. When two PNs contain all the same key essential concepts except for one, the difference in the two scores would be the value of the regression coefficient for the missing key essential. It was important for us to be able to trace differences in scores to differences in the specific concepts detected in the patient notes.

In contrast, alternative approaches can potentially improve score prediction accuracy but may be less interpretable. Deep learning prediction models are known to be effective at finding nuanced relationships between the input features that optimize prediction accuracy, but the way the model arrives at the final prediction can be obscure and opaque. In such a case we are no longer able to describe why two examinees received different scores on their PN. However, this might be an acceptable scoring model for an assessment with lower stakes or a different purpose [25].

Data interpretation

The INCITE program works similarly for the DI portion of the PN. The diagnosis entries are evaluated the same way as the key essential phrases in the DG portion of the PN. Our investigations show that INCITE is more accurate at identifying the diagnosis terms, as the lexical variants for each diagnosis are usually limited to a few synonyms and medical shorthand.

The scoring of the DI section is guided by a series of rules that depend on the prioritization (order) and categorization of the diagnosis provided by the examinee. The three broad categories of diagnoses derived from the physician-rater PN scoring guidelines are expected, plausible, and implausible. The expected diagnoses are those that are listed on the key essentials for the case portrayed by the standardized patient. With multiple expected diagnoses, there is often an expected order of these diagnoses in the examinee differential, such that one expected diagnosis is “Lead-Expected” and the second is “Second-Expected.” Other cases allow for expected diagnoses to be interchangeable in their position on the differential. The plausible diagnosis category is given to those recorded diagnoses that are plausible given the information extracted in the data gathering portion of the note and have some clinical value to their pursuit, such that it could potentially lead to one of the expected differential diagnoses. Implausible diagnoses are those that are not supported by the information in the data gathering section and/or could relate to a harmful misdiagnosis in a patient with the presenting clinical conditions. As part of the annotation process, physician-raters were provided with a list of those diagnoses that were observed for a given case and asked to provide the ‘Lead-Expected’, ‘Second-Expected’, etc., classification.

The data interpretation scoring guidelines for human raters designate the scores that examinees should receive based on the order and category of the diagnoses are presented. For example, if a plausible diagnosis is listed higher than an expected diagnoses in an examinee’s differential – an “order error” – this is associated with a specific point penalty. Similarly, if an examinee lists an implausible diagnosis in any position, this also incurs a specific point penalty. Such penalties may be combined, as would be the case where an examinee responded with an implausible diagnosis in the first position ahead of the expected leading diagnosis, which would be the case for a “Implausible-Lead-Expected” response pattern.

Student responses are grouped by their observed differential response pattern for computer-assisted scoring, with the response pattern conveying information about the order and category of the diagnoses. The physician-rater scores for each PN containing this response pattern were averaged, and the average response pattern score was given to all PNs that have this response pattern. This is equivalent to a linear regression model with all interaction terms between the three possible positions included. Consequently, the score penalties are present in the score without needing to be explicitly programmed.

Once DG and DI scores were calculated for a given case, these scores are combined to arrive at a case score. These case-level scores were then combined across all cases to arrive at an overall PN score for that examinee. Internal research was conducted to identify a PN score threshold that identified highly performing examinees whose probability of failing, given the measurement error, was zero. If an examinee had a PN score above the threshold, then their notes would be considered a “Pass” and would not require human rater review.

The INCITE system was developed to err on the side of false negatives (when a key essential concept was present in the note but was undetected) instead of false positives (when a key essential concept that was not present in the note was deemed present by INCITE). This was aligned with the purpose of the computer-assisted scoring system in identifying high-scoring examinees. It was more desirable to have a proficient examinee have their PNs reviewed by a physician-rater than to pass an examinee who did not demonstrate the requisite level of proficiency and not have their notes scored by human raters. Our simulations showed that the NLP-based computer-assisted scoring system was so accurate that we could pass nearly 75% of the examinees with zero classification error.

Discussion

The NLP-based computer-assisted scoring system presented in this manuscript was developed specifically for a pointed use: to identify students with very high passing scores in a summative, high-stakes medical licensure assessment. Decisions made regarding feature engineering and subsequent score modeling were consistent with this purpose. However, if the system were to be used for a different purpose, changes to the model would have to be made accordingly. For example, INCITE and the computer-assisted scoring system could be redesigned to provide formative feedback to students about what key features were not documented in their PN or how their differential compared to the case blueprint differential, which could then be reviewed with peers or with an instructor. Similarly, this student-specific information could be aggregated to provide feedback to an instructor about class-wide learning opportunities, or even reported at an institutional level to inform program evaluation and curriculum planning and development.

Continued work to the NLP-based computer-assisted scoring system is still ongoing, as the ability to correctly identify medical concepts in constructed responses has general utility in assessment. To that end, NBME recently sponsored a Kaggle data science competition to generate additional methods for identifying clinical phrases in text [26]. Many data scientists that performed well in the competition have made their solutions freely available with heavily annotated computer code included so that others may reproduce their results (e.g. Ref. [27]); more solutions can be found by in the right-hand column of the competition leaderboard [28]. In addition, interested researchers can apply for access to the full dataset used in the Kaggle competition by submitting a proposal to NBME’s Data Sharing portal [29]. More detail about the data is available elsewhere [30]. Together, using this data and the freely-available solutions can provide institutions with a jump-start to start building their own NLP-based computer-assisted scoring systems.


Corresponding author: Christopher R. Runyon, PhD, Senior Measurement Scientist, Growth and Innovation, National Board of Medical Examiners, 3750 Market St., Philadelphia, PA 19106, USA, E-mail:

  1. Research funding: None declared.

  2. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: Authors state no conflict of interest.

  4. Informed consent: Not applicable.

  5. Ethical approval: The American Institutes for Research (AIR) has determined that all research stemming from operational activities (licensure testing) is exempt from further human subjects review.

References

1. Young, M, Thomas, A, Lubarsky, S, Ballard, T, Gordon, D, Gruppen, LD, et al.. Drawing boundaries: the difficulty in defining clinical reasoning. Acad Med 2018;93:990–5. https://doi.org/10.1097/acm.0000000000002142.Search in Google Scholar PubMed

2. Bowen, JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med 2006;355:2217–25. https://doi.org/10.1056/nejmra054782.Search in Google Scholar PubMed

3. Cook, DA, Durning, SJ, Sherbino, J, Gruppen, LD. Management reasoning: implications for health professions educators and a research agenda. Acad Med 2019;94:1310–6. https://doi.org/10.1097/acm.0000000000002768.Search in Google Scholar PubMed

4. Abdoler, EA, O’Brien, BC, Schwartz, BS. Following the script: an exploratory study of the therapeutic reasoning underlying physicians’ choice of antimicrobial therapy. Acad Med 2020;95:1238–47. https://doi.org/10.1097/acm.0000000000003498.Search in Google Scholar

5. Daniel, M, Rencic, J, Durning, SJ, Holmboe, E, Santen, SA, Lang, V, et al.. Clinical reasoning assessment methods: a scoping review and practical guidance. Acad Med 2019;94:902–12. https://doi.org/10.1097/acm.0000000000002618.Search in Google Scholar PubMed

6. Association for Computational Linguistics. Conference acceptance rates [Internet]. Association for Computational Linguistics; 2021. Available from: https://aclweb.org/aclwiki/Conference_acceptance_rates [Accessed 8 May 2022].Search in Google Scholar

7. Harik, P, Clauser, BE, Grabovsky, I, Nungester, RJ, Swanson, D, Nandakumar, R. An examination of rater drift within a generalizability theory framework. J Educ Meas 2009;46:43–58. https://doi.org/10.1111/j.1745-3984.2009.01068.x.Search in Google Scholar

8. Gingerich, A, Regehr, G, Eva, KW. Rater-based assessments as social judgments: rethinking the etiology of rater errors. Acad Med 2011;86:S1–7. https://doi.org/10.1097/acm.0b013e31822a6cf8.Search in Google Scholar PubMed

9. Kogan, JR, Conforti, LN, Iobst, WF, Holmboe, ES. Reconceptualizing variable rater assessments as both an educational and clinical care problem. Acad Med 2014;89:721–7. https://doi.org/10.1097/acm.0000000000000221.Search in Google Scholar

10. Yan, D, Rupp, AA, Foltz, PW, editors. Handbook of automated scoring: theory into practice. Boca Raton, FL: CRC Press; 2020.10.1201/9781351264808Search in Google Scholar

11. Attali, Y, Bridgeman, B, Trapani, C. Performance of a generic approach in automated essay scoring. J Technol Learn Assess 2010;10:1–16. https://ejournals.bc.edu/index.php/jtla/article/view/1603.Search in Google Scholar

12. Williamson, DM, Xi, X, Breyer, FJ. A framework for evaluation and use of automated scoring. Educ Meas 2012;31:2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x.Search in Google Scholar

13. Salt, J, Harik, P, Barone, MA. Leveraging natural language processing: toward computer-assisted scoring of patient notes in the USMLE step 2 clinical skills exam. Acad Med 2019;94:314–6. https://doi.org/10.1097/acm.0000000000002558.Search in Google Scholar PubMed

14. Mitkov, R, editor. The oxford handbook of computational linguistics. Oxford: Oxford University Press; 2004.Search in Google Scholar

15. Burstein, J, Tetreault, J, Madnani, N. The e-rater automated essay scoring system. In: Shermis, M, Burstein, J, editors. Handbook of automated essay evaluation: current applications and new directions. Oxfordshire: Routlege; 2013.10.4324/9780203122761.ch4Search in Google Scholar

16. Lee, NT. Detecting racial bias in algorithms and machine learning. J Inf Commun Ethics Soc 2018;16:252–06. https://doi.org/10.1108/jices-06-2018-0056.Search in Google Scholar

17. Jonson, JL, Geisinger, KF, editors. Fairness in educational and psychological testing: examining theoretical, research, practice, and policy implications of the 2014 standards. Washington, DC: American Educational Research Association; 2022.10.3102/9780935302967Search in Google Scholar

18. Johnson, MS, Liu, X, McCaffrey, DF. Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. J Educ Meas 2022;59:338–61.10.1111/jedm.12335Search in Google Scholar

19. Leacock, C, Chodorow, M. C-rater: automated scoring of short-answer questions. Comput Humanit 2003;37:389–405. https://doi.org/10.1023/a:1025779619903.10.1023/A:1025779619903Search in Google Scholar

20. Lippi, M, Torroni, P. Argumentation mining: state of the art and emerging trends. ACM Trans Internet Technol 2016;30:1–25. https://doi.org/10.1145/2850417.Search in Google Scholar

21. Baldwin, SG, Harik, P, Keller, LA, Clauser, BE, Baldwin, P, Rebbecchi, TA. Assessing the impact of modifications to the documentation component’s scoring rubric and rater training on USMLE integrated clinical encounter scores. Acad Med 2009;84:S97–100. https://doi.org/10.1097/acm.0b013e3181b361d4.Search in Google Scholar PubMed

22. Sarker, A, Klein, AZ, Mee, J, Harik, P, Gonzalez-Hernandez, G. An interpretable natural language processing system for written medical examination assessment. J Biomed Inf 2019;98:103268. https://doi.org/10.1016/j.jbi.2019.103268.Search in Google Scholar PubMed

23. Harik, P, Mee, J, Runyon, C, Clauser, B. Assessment of clinical skills: a case study in constructing an NLP-based scoring system for patient notes. In: Yaneva, V, von Davier, M, editors. Advancing natural language processing in educational assessments. London: Taylor & Francis; 2023.10.4324/9781003278658-5Search in Google Scholar

24. Marcinkevičs, R, Vogt, JE. Interpretability and explainability: A machine learning zoo mini-tour. arXiv preprint arXiv:2012.01805. 2020.Search in Google Scholar

25. Lottridge, S, Ormerod, C, Jafari, A. Psychometric considerations when using deep learning for automated scoring. In: Yaneva, V, von Davier, M, editors. Advancing natural language processing in educational assessments. London: Taylor & Francis; 2023.10.4324/9781003278658-3Search in Google Scholar

26. Kaggle. Nbme – score clinical patient notes [Internet]. Kaggle. Available from: https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/overview [Accessed 8 May 2022].Search in Google Scholar

27. CPMP. #2 solution [Internet]. Kaggle. Available from: https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/323085 [Accessed 8 May 2022].Search in Google Scholar

28. Kaggle. Nbme – score clinical patient notes [Internet]. Kaggle. Available from: https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/leaderboard [Accessed 15 Aug 2022].Search in Google Scholar

29. NBME. Data sharing [Internet]. Philadelphia (PA): NBME. Available from: https://www.nbme.org/services/data-sharing [Accessed 8 May 2022].Search in Google Scholar

30. Yaneva, V, Mee, J, Ha, LA, Harik, P, Jodoin, M, Mechaber, A. The USMLE® step 2 clinical skills patient note corpus. In: Proceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies. Seattle, United States: Association for Computational Linguistics; 2022:2880–6 pp.10.18653/v1/2022.naacl-main.208Search in Google Scholar

Received: 2022-05-10
Accepted: 2022-10-10
Published Online: 2022-11-21

© 2022 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 8.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/dx-2022-0047/html
Scroll to top button