A validity study of COMLEX-USA Level 3 with the new test design

Xia Mao; John R. Boulet; Jeanne M. Sandella; Michael F. Oliverio; Larissa Smith

doi:10.1515/jom-2023-0011

Article Open Access

A validity study of COMLEX-USA Level 3 with the new test design

Xia Mao , John R. Boulet , Jeanne M. Sandella , Michael F. Oliverio and Larissa Smith

Published/Copyright: March 19, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Osteopathic Medicine Volume 124 Issue 6

Abstract

Context

The National Board of Osteopathic Medical Examiners (NBOME) administers the Comprehensive Osteopathic Medical Licensing Examination of the United States (COMLEX-USA), a three-level examination designed for licensure for the practice of osteopathic medicine. The examination design for COMLEX-USA Level 3 (L3) was changed in September 2018 to a two-day computer-based examination with two components: a multiple-choice question (MCQ) component with single best answer and a clinical decision-making (CDM) case component with extended multiple-choice (EMC) and short answer (SA) questions. Continued validation of the L3 examination, especially with the new design, is essential for the appropriate interpretation and use of the test scores.

Objectives

The purpose of this study is to gather evidence to support the validity of the L3 examination scores under the new design utilizing sources of evidence based on Kane’s validity framework.

Methods

Kane’s validity framework contains four components of evidence to support the validity argument: Scoring, Generalization, Extrapolation, and Implication/Decision. In this study, we gathered data from various sources and conducted analyses to provide evidence that the L3 examination is validly measuring what it is supposed to measure. These include reviewing content coverage of the L3 examination, documenting scoring and reporting processes, estimating the reliability and decision accuracy/consistency of the scores, quantifying associations between the scores from the MCQ and CDM components and between scores from different competency domains of the L3 examination, exploring the relationships between L3 scores and scores from a performance-based assessment that measures related constructs, performing subgroup comparisons, and describing and justifying the criterion-referenced standard setting process. The analysis data contains first-attempt test scores for 8,366 candidates who took the L3 examination between September 2018 and December 2019. The performance-based assessment utilized as a criterion measure in this study is COMLEX-USA Level 2 Performance Evaluation (L2-PE).

Results

All assessment forms were built through the automated test assembly (ATA) procedure to maximize parallelism in terms of content coverage and statistical properties across the forms. Scoring and reporting follows industry-standard quality-control procedures. The inter-rater reliability of SA rating, decision accuracy, and decision consistency for pass/fail classifications are all very high. There is a statistically significant positive association between the MCQ and the CDM components of the L3 examination. The patterns of associations, both within the L3 subscores and with L2-PE domain scores, fit with what is being measured. The subgroup comparisons by gender, race, and first language showed expected small differences in mean scores between the subgroups within each category and yielded findings that are consistent with those described in the literature. The L3 pass/fail standard was established through implementation of a defensible criterion-referenced procedure.

Conclusions

This study provides some additional validity evidence for the L3 examination based on Kane’s validity framework. The validity of any measurement must be established through ongoing evaluation of the related evidence. The NBOME will continue to collect evidence to support validity arguments for the COMLEX-USA examination series.

Keywords: COMLEX-USA; level 3; licensure; osteopathic medicine; test scores; validity

The National Board of Osteopathic Medical Examiners (NBOME) administers the Comprehensive Osteopathic Medical Licensing Examination of the United States (COMLEX-USA), a three-level examination designed for licensure for the practice of osteopathic medicine. COMLEX-USA Level 3 (L3) assesses competence in the foundational-competency domains required for generalist physicians to deliver safe and effective osteopathic medical care and promote health in unsupervised clinical settings [1]. The examination design was changed in September 2018. It is now a 2-day computer-based examination with two components: a multiple-choice question (MCQ) component with the single best answer and a clinical decision-making (CDM) case component with extended multiple-choice (EMC) questions and short answer (SA) questions.

The validation of licensure examination scores, or any inferences we wish to make based on the examination scores, is paramount. While there are several frameworks that one can utilize to classify validity evidence [2, 3], the model proposed by Kane is often employed [4, 5]. Validity evidence can be gathered based on four components: Scoring, Generalization, Extrapolation, and Implication/Decision. The Scoring inference applies to the phase of scoring of a single observation, such as a multiple-choice examination question, skill station, or clinical observation. Examples of validity evidence under this category include item and response option performance, standardization and equating across forms, applicability of scoring rubrics, rater selection and training, data security, and quality control (QC). The Generalization inference applies to the phase of utilizing the score(s) from the observation to generate an overall test score to represent performance in the test setting. Examples of validity evidence include reliability or generalizability of item scores and rater judgments. The Extrapolation component applies to the phase of drawing an inference from the test score to what it might imply for real-life performance. Examples of validity evidence under this category include associations with other measures and evaluations of differences in test performance among subgroups. The Implication/Decision inference applies to the phase of interpreting the test score and making a decision. Examples of validity evidence under this category include the appropriateness of the pass/fail standard, effectiveness of actions based on assessment results, and intended or unintended consequences of testing. It should be noted that the process of validation is never complete. One can always gather additional evidence to support the validity argument.

The validation of test scores, or inferences we make based on the test scores, is important in the health professions, especially when practitioners are required to make sound decisions regarding patient/client care [6], [7], [8]. In the medical licensure arena, several studies have reported evidence to support the validity of the COMLEX-USA licensure examination series. These include investigations of the relationships between scores (and subscores) from different examinations including preparation tests, studies of the relationship of COMLEX performance with later outcome (e.g., residency performance, board certification, disciplinary actions), and published justifications for the design and content [9], [10], [11], [12], [13]. Although there have been some investigations of COMLEX-USA L3 [14], [15], [16], gathering additional validity evidence is certainly warranted. More importantly, with the introduction of the new L3 design in 2018, which included CDM items, previous validity evidence, some related to the administration of CDM items on other examinations [17], could be augmented. The continued validation of the L3 examination under the new design is vital to the appropriate interpretation and use of the test scores. The purpose of this investigation is to provide additional evidence to support the validity of COMLEX-USA L3 scores and associated pass/fail decisions.

Methods

Measures

Level 3 examination

All examinations in the COMLEX-USA series share the master blueprint based on the same two dimensions: competency domain and clinical presentation [18]. The evidence-based blueprint for the L3 examination, with the minimum percentage of items for each competency domain and clinical presentation aligned with the practice of osteopathic medicine, is well-documented [1, 13].

The L3 examination includes six sections of MCQ items and two sections of CDM items. The MCQ portion of the examination consists of a total of 420 items, each having a single best answer and scored as either correct or incorrect. The CDM portion of the examination contains 26 cases, each with a series of clinical scenarios followed by two to four questions. The CDM cases assess “key features”: critical decisions or challenges related to patient care that are routinely faced by osteopathic generalist physicians. There are two item types for the CDM case. One is EMC items, in which the candidate may select single or multiple answers from a list of options. The other type is SA items, in which the candidates are required to type their answers in the box(es) provided with a word limit. Some EMC or SA items are scored as 1 (correct) or 0 (incorrect), whereas others have partial scores between 0 and 1 based on scoring rubrics defined by subject matter experts (SMEs).

A single score is reported for the L3 examination by aggregating the MCQ and the CDM components. Both the MCQ portion and the CDM portion contain pretest items or key features that do not contribute to the score that the candidate receives. Each operational MCQ question or CDM key feature contributes equally to the candidate’s score, therefore the weighting of the MCQ and CDM components in scoring is determined by the total number of the questions/key features in each component, which reflects the test blueprint and test specifications. The NBOME utilizes a psychometric model to convert each candidate’s total raw score (i.e., the number correct on multiple choice questions and key features) to a standard score [19]. The score report provided to a L3 candidate contains the standard score, a pass or fail result, and a graphical representation of the performance profile that summarizes strengths and weaknesses in relation to the two examination blueprint dimensions and clinical disciplines.

Level 2-PE examination

The COMLEX-USA Level 2 Performance Evaluation (L2-PE) is a patient-presentation-based assessment of fundamental osteopathic clinical skills [20]. It was initially administered in 2004 and last administered in February 2020 (due to the COVID-19 pandemic). Several studies support the validity of the L2-PE scores [9, 21, 22]. It measures, albeit in a different format, some constructs related to L3 (e.g., application of biomedical knowledge) and some that are not (e.g., communication skills with a patient). For L2-PE, each candidate must demonstrate their clinical skills when they are presented with 12 standardized patient encounters, 11 of which are scored. A pass/fail score is reported for two domains: the humanistic domain (HM), which measures physician–patient communication, interpersonal skills, and professionalism; and the biomedical/biomechanical domain (BM), which measures history and physical examination for data gathering (DG), documentation skills (subjective-objective-assessment-plan [SOAP] notes), and the performance of osteopathic manipulative treatment (OMT). Candidates must pass both domains on the same administration to pass the L2-PE. The three subdomain skills under BM are compensatory, in which low performance on one subdomain could be offset with higher performance on one or more other subdomains.

Study sample

The analysis data for this study contains first-attempt test scores for 8,366 candidates who took the L3 examination between September 2018 and December 2019. The first-attempt test scores of the L2-PE for the same group of candidates were also utilized to study the association between the scores and subscores from the two examinations. Demographic data, including gender, race and first language, were self-reported by the candidates during registration; as such, there would be missing data in the demographic categories if the candidates chose not to answer. Among the 8,366 candidates, 8,087 (96.7 %) reported demographic information for gender, race, and first language.

Analysis

Multiple potential sources of validity evidence were explored for the newly designed L3 examination scores utilizing Kane’s validity framework. For Scoring, this includes content coverage of the L3 examination, standardization across forms, and QC in the scoring and reporting processes. Here, the validity evidence is based primarily on the documentation of test assembly and scoring processes. For Generalization, we estimated the reliability of the L3 examination scores, rater consistency in the scoring of the CDM SA items, and decision accuracy and decision consistency of pass/fail classifications. For Extrapolation, we summarized the association between the scores from the MCQ component and CDM component and between scores from different competency domains of the L3 examination, the association of L3 scores with scores from a performance-based assessment that measure related and unrelated constructs (L2-PE), and subgroup comparisons for candidates taking the L3 examination. We expected that the MCQ and CDM scores would be moderately related. We also hypothesized that, based on the measured domains, specific components of the L3 and L2-PE examinations would be more highly related than others (e.g., interpersonal skills and communication and professionalism as measured in L3 and L2-PE, L3 CDM component and ability to write a SOAP note [Level 2-PE]). Finally, based on the prevailing literature, we postulated that certain groups of candidates (e.g., those whose first language was not English) would not perform as well, on average, on L3. For the Implication/Decision inference, we documented and justified the pass/fail standard. Table 1 shows how these sources of evidence fit into Kane’s validity framework and the analysis methodology for each piece of evidence.

Table 1:

Validity evidence and methodology for data collection and analysis.

Validity component	Sources of evidence	Methodology
Scoring	Content representation and standardization across forms	Description of the form assembly process
Scoring	Scoring and reporting process	Description of quality control procedures in the scoring and reporting process
Generalization	Reliability of the examination	Stratified alpha (with MCQ and CDM as strata)
	Rater consistency of the CDM short answer items	Percentage of agreement between the first and second raters
	Decision accuracy and decision consistency	Expected classification accuracy Rudner [27]
Extrapolation	Association between the scores from the MCQ component and CDM component and between scores from different competency domains	Pearson correlations between MCQ score, CDM score and subdomain scores
	Association with the L2-PE exam	Pearson correlations corrected for measurement error (disattenuated) between L3 MCQ scores, CDM scores and L2-PE domain and subdomain scores
	Subgroup comparison based on gender, ethnicity, language group	Mean differences and effect sizes (by Cohen’s D) of standard scores between the subgroups in three demographic categories
Implication	Standard set through implementation of a defensible and properly implemented procedure	Documentation of the standard setting process

CDM, clinical decision-making; MCQ, multiple-choice question; L2-PE, COMLEX-USA Level 2 Performance Evaluation.

The investigation was considered exempt for review by the Institutional Review Board (IRB) of the NBOME. Only de-identified data was utilized for the analyses.

Results

The results are summarized and grouped utilizing Kane’s validity framework.

Scoring

The blueprint for the L3 examination was constructed by osteopathic physicians in the context of clinical problem-solving [23]. To control item exposure and ensure test security, 16 parallel forms were administered in this cycle. Before the forms were assembled, SMEs reviewed the item pools utilized in the current cycle for accuracy and applicability with respect to current medical practice; only items deemed eligible were kept in the pools for form assembly. The forms were built through the automated test assembly (ATA) procedure to maximize parallelism in terms of content coverage and statistical properties [24]. All 16 forms in this study met the blueprint specifications and were reviewed by SMEs through a standardized procedure to confirm content coverage. Items were replaced and reviewed again by SMEs if they were found to have content overlap with other items on the same form. Through the ATA procedure, the forms were parallel in average item difficulty and item discrimination, and test information was maximized at the cut score points to ensure the highest precision [16].

Slight differences in form difficulty were adjusted through an item-response theory-based concurrent calibration procedure [25]. The passing scores for each examination form were adjusted accordingly to account for any differences in form difficulty. QC procedures were implemented in the scoring and reporting process to ensure the accuracy of the computed scores and score reports delivered to the candidates. These include independent rating of SA items, monitoring rating consistency, key validation (ensuring the administered items are psychometrically sound), and independent replication of item calibration, equating, and scoring.

The SA items are scored by the scoring engine through auto matches if the responses correspond to the prescored responses in the machine or by human scorers if the responses differ. Currently, all unmatched responses are scored independently by at least two human scorers. In case of disagreement between the two scorers, a third scorer will review responses and scores from the first two scorers, providing an additional score. The scoring engine also conducts a consistency check and flags inconsistent scores for similar responses. Following this process, a QC scorer will review all scoring results from the third scorer and those flagged as inconsistent by the scoring engine and make a final determination of the score. The SA scorers are board-certified osteopathic physicians who have gone through a rigorous training process. They receive instructions on the scoring rubrics and the scoring engine and then complete a practice assignment. Scorers are provided feedback on their scores for the practice assignment, and additional practice is assigned if they do not meet certification criteria. Once new scorers are certified, they are assigned to rate pretest cases first and be paired with experienced scorers. After they can maintain an inter-rater reliability above 0.95 for three batches, they are promoted to primary scorer and are permitted to score any CDM case. Scorers at any level receive additional training and feedback if their scores consistently differ from the final scores.

All items counted toward the candidates’ total scores have undergone pretesting and validation. Item analyses are conducted to evaluate the quality of these pretest items. Item statistics, including item difficulty, item discrimination, and distractor information, are utilized to flag items for content review. Additionally, based on candidate comments on specific items, a content review could also be initiated. During the content review, SMEs review the flagged items, validate the correct answers, and determine whether the items can be utilized in future forms or need to be revised and re-pretested.

All psychometric processes that lead to the final scores for the candidates, including item calibration, equating, and scoring, are independently replicated by at least two analysts to ensure scoring accuracy. In addition, candidates may request score confirmation for individual scores after the scores have been reported. In this case, additional score validation will be conducted.

Generalization

The estimated reliabilities of the L3 examination total scores, as measured by stratified alpha [26], ranged from 0.86 to 0.89 across the 16 forms with an average of 0.88. For CDM SA items, the inter-rater consistency between the primary and secondary raters was above 95 %. Decision accuracy and decision consistency indices indicate the accuracy and consistency of classification of the examinees into the pass/fail categories. Utilizing Rudner’s method [27], the overall decision accuracy estimate of the L3 examination is 0.98 and the overall decision consistency estimate is 0.97.

Extrapolation

Table 2 shows the Pearson correlations between different L3 competency domains and between the MCQ component and the CDM component; reliability estimates are also provided. Pearson correlations between each competency domain and the L3 total score ranged from 0.41 to 0.92 and were all statistically significant (p<0.01). The L3 total score was most strongly correlated with osteopathic patient care and procedural skills (0.92) and application of knowledge (0.82) and least strongly correlated with interpersonal and communication skills (0.41) and professionalism (0.43).

Table 2:

The Pearson correlations between MCQ, CDM, and competency domains of the Level 3 examination.

	Total	CD1	CD2	CD3	CD4	CD5	CD6	CD7	MCQ	CDM
Total	0.88
Osteopathic principles, practice, and manipulative treatment (CD1)	0.53	0.59
Osteopathic patient care and procedural skills (CD2)	0.92	0.36	0.79
Application of knowledge for osteopathic medical practice (CD3)	0.82	0.35	0.69	0.71
Practice-based learning and improvement in osteopathic medical practice (CD4)	0.69	0.28	0.56	0.49	0.43
Interpersonal and communication skills in the practice of osteopathic medicine (CD5)	0.41	0.15	0.31	0.26	0.25	0.20
Professionalism in the practice of osteopathic medicine (CD6)	0.43	0.15	0.31	0.28	0.25	0.17	0.32
Systems-based practice in osteopathic medicine (CD7)	0.59	0.25	0.46	0.41	0.35	0.21	0.26	0.25
MCQ	0.99	0.54	0.90	0.83	0.66	0.38	0.43	0.56	0.87
CDM	0.63	0.26	0.60	0.41	0.50	0.37	0.24	0.45	0.51	0.50

The numbers in bold are reliabilities of the total score and subdomain scores. CDM, clinical decision-making; MCQ, multiple-choice question.

Pearson correlations between the competency domain subscores for L3 ranged from 0.15 to 0.69, with the strongest of those correlations between osteopathic patient care and procedural skills and application of knowledge (r=0.69) and the weakest between interpersonal and communication skills and osteopathic principles, practice, and manipulative treatment and between professionalism and osteopathic principles, practice, and manipulative treatment (both 0.15). In general, interpersonal and communication skills and professionalism had weaker correlations with other competency domains. There was a moderate association between the MCQ component and the CDM component, with a Pearson correlation of 0.51. Correlations between the MCQ component score and the competency domain scores were generally higher than those between the CDM component scores and the competency domain scores.

Table 3 shows the disattenuated correlations (corrected for measurement error) between the L3 scores and L2-PE scores. The L3 total score and subscores had higher disattenuated correlations with the BM domain score than with the HM domain score. The disattenuated correlations between the L3 total score or subscores and the BM domain score ranged from 0.22 to 0.44, with that between professionalism and the BM domain score the lowest (0.22). The disattenuated correlations between the L3 total score and subscores and the HM domain score ranged from 0.13 to 0.31. As expected, interpersonal and communication skills and professionalism had the highest disattenuated correlations with HM among all the L3 subscores. Because systems-based practice had a relatively higher correlation with interpersonal and communication skills and professionalism, it also had relatively higher correlation to HM. The CDM component also had higher disattenuated correlations with HM than the MCQ component.

Table 3:

The disattenuated correlation of L3 competency domain scores with L2-PE domain or subdomain scores.

		L2-PE
		Humanistic	Biomedical/biomechanical	Data gathering	SOAP	Osteopathic manipulative treatment
L3	Total	0.20	0.40	0.36	0.35	0.25
	Osteopathic principles, practice, and manipulative treatment	0.13	0.27	0.27	0.19	0.23
	Osteopathic patient care and procedural skills	0.17	0.39	0.35	0.35	0.22
	Application of knowledge for osteopathic medical practice	0.16	0.37	0.35	0.32	0.16
	Practice-based learning and improvement in osteopathic medical practice	0.16	0.38	0.33	0.35	0.23
	Interpersonal and communication skills in the practice of osteopathic medicine	0.31	0.44	0.35	0.41	0.37
	Professionalism in the practice of osteopathic medicine	0.25	0.22	0.15	0.24	0.22
	Systems-based practice in osteopathic medicine	0.25	0.38	0.32	0.33	0.33
	MCQ	0.18	0.39	0.36	0.34	0.23
	CDM	0.24	0.39	0.31	0.36	0.34

CDM, clinical decision-making; L2-PE, COMLEX-USA Level 2 Performance Evaluation; MCQ, multiple-choice question; SOAP, subjective-objective-assessment-plan.

The disattenuated correlations between the L3 total score and subscores and the subscores of BM ranged from 0.15 to 0.41. Interpersonal and communication skills tend to have higher correlations with the BM subscores than other L3 competency domains. The CDM component, designed to assess higher-order clinical application functions, also had higher correlation with SOAP and OMT than the MCQ component.

Table 4 shows the average standard scores, standard deviation of standard scores, effect size of standard score difference, and passing rates by subgroups for the L3 examination. In general, subgroups of male candidates, white candidates, and candidates with English as their first languages had higher average scores than their corresponding counterpart in the same category, and their passing rates were slightly higher. The standard score difference [28] is bigger in the race (effect size=0.44) and first language categories (effect size=0.3) and smaller in the gender category (effect size=0.15).

Table 4:

The average standard scores, standard deviation of standard scores, and passing rates for L3 by subgroups.

Category	Subgroups	n	Mean	Standard deviation	Effect size	Passing rate
Gender	Male	4,497	581	126	0.15	97.6 %
Gender	Female	3,590	562	120	0.15	97.2 %
Race	White	5,087	592	124	0.44	98.3 %
Race	Other races^a	3,000	539	116	0.44	96.0 %
First language	English	7,777	574	124	0.31	97.5 %
First language	Other languages^b	310	537	115	0.31	95.2 %

^aOther races include African, Asian, Black, Hispanic, Native Hawaiian or Pacific Islander, Native, Indigenous, American Indian, and/or Alaska Native, and those who prefer not to answer. ^bOther languages include candidates who reported that their first language was not English.

Implication/Decision

The pass/fail standard is criterion-based for the L3 examination and is based solely on a candidate’s performance on the total examination. For the L3 examination, a passing score means that the candidate has demonstrated at least a minimal level of competency required for the safe and effective unsupervised, independent practice of osteopathic medicine. The NBOME follows industry-standard best practices to determine minimum pass/fail standards for its COMLEX-USA examinations [2, 29]. This includes gathering data from stakeholders concerning expected passing rates and convening a standard-setting panel to participate in a modified Angoff standard-setting study. The NBOME also periodically reviews and updates the blueprint (in 2-year cycles) and the minimum pass/fail standards (every 3–5 years) to ensure that the tested knowledge and skills on the examination and the standards reflect current osteopathic medical education and practice required for licensure.

Discussion

The validation of any assessment must be established through ongoing collection and evaluation of the related evidence. This is even more important for licensure examinations, where failing performance may restrict access to the profession [17]. The NBOME has conducted numerous studies to collect data to support claims that the examinations utilized for licensure measure what they are supposed to measure. This study provides some additional validity evidence for L3 with the new design that includes CDM items.

For the present investigations, different types of evidence based on Kane’s validity framework were collected [30]. From a scoring perspective, the L3 examination was administered under standardized conditions with equivalent scoring algorithms applied to multiple, parallel, examination forms, all constructed with reference to a master blueprint. These steps to ensure that scores represent the domains of interest and are accurate, including form creation, detailed item review, defined QC processes, and measures to ensure scoring consistency, provide evidence to support the validity of the L3 examination scores.

With respect to generalization, the test forms are constructed utilizing a systematic process. The total L3 score is relatively free from measurement error and, for the CDM write-in component, the inter-rater reliability is quite high. Decision accuracy and decision consistency for classification decisions (pass/fail) are both high. The L3 examination, consisting of both MCQ and CDM items, yields reproducible scores and pass/fail decisions.

From an extrapolation viewpoint, the patterns of associations, both within the L3 subscores and with other licensure examination scores measuring somewhat different constructs, fit with what is being measured. The moderate correlation between L3 MCQ scores and L3 CDM scores indicates that the CDM portion measures related but slightly different skills from the MCQ portion. To some extent, this supports the added value of the CDM component in this new design of the L3 examination. The lower domain total score correlation for interpersonal and communication skills and professionalism are likely attributable to the difficulty of measuring these domains utilizing a MCQ or CDM item format. Alternatively, they may indicate that measuring some domains such as interpersonal and communication skills would be more valid utilizing a performance-based assessment format.

From a criterion perspective, there were small to moderate correlations between the L3 total scores and the two L2-PE domain scores, indicating that the L3 examination measures related but distinct constructs from the L2-PE. Correlations between similar competency domains measured by the two examinations (e.g., interpersonal and communication skills vs. the HD) tend to be higher than those between dissimilar competency domains (e.g., osteopathic patient care and procedural skills vs. HD). These findings are consistent with those from previous studies on the relationship between the computer-based assessment scores and performance-based assessment scores of the COMLEX examination series [9]. In addition, the L3 CDM and MCQ components tend to correlate more highly with the biomedical/biomechanical parts of the L2-PE. In practice, physicians must interpret clinical data to make clinical decisions. The combination of applying clinical knowledge (MCQ items) and working through CDM cases in L3 is similar, at least with respect to the competencies required, to gathering data (patient history, physical examination) and summarizing this information in the form of a clinical note (L2-PE). The internal structure of the L3 examination, combined with its relationships with other examinations, provides additional evidence to support the construct validity of the scores.

The performance comparisons between subsets of test takers yielded findings that have already been found and discussed in the literature. For example, score difference across racial groups with larger effect sizes were found in other medical licensure examinations such as the Medical College Admission Test (MCAT) and the United States Medical Licensing Examination (USMLE) [31, 32]. These subgroup differences were likely related to the socioeconomic disparities that exist among the racial groups or the prior academic difference among the subgroups. We also found that, on average, male candidates outperformed female candidates on the L3. This performance difference, albeit small, has been documented on other examinations measuring application of knowledge [33]. Our findings related to performance by first language also support the validity argument. Candidates whose first language is not English are likely to have lower reading abilities, potentially affecting their overall performance [31].

From an Implication/Decision perspective, we documented that the L3 pass/fail standard was established through implementation of a defensible and properly applied procedure, a fundamental requirement for any valid inference concerning competence [34]. As with the other components of Kane’s validity framework, this provides additional evidence to support the validity of the L3 scores and associated pass/fail decisions. Going forward, as part of ongoing validation efforts, it would also be appropriate to document whether examinees who fail the examination improve to meet the standard [34].

Although we provided some evidence to support the validity of L3 scores and associated pass/fail decisions, the present investigation is not without limitations. The validation of any assessment must be established through the ongoing collection and evaluation of the related evidence [35]. For this study, we looked at the performance data of L3 candidates who first took the examination under the new format. Given the ongoing changes in medical practice and associated alterations to test content, additional analyses of the performance of more recent testing cohorts are certainly warranted. Likewise, from an extrapolation perspective, our reported validity evidence is far from complete. Investigation of the relationship of L3 performance with residency assessments, board certification scores, disciplinary actions, and patient outcomes would help strengthen the extrapolation argument. There are also structural aspects of the examination (administration over 2 days) that may interfere with the assessment of proficiencies of interest that should be investigated. A more thorough exploration of potential construct-irrelevant sources of variability in examinee scores is needed [36]. With the addition of the CDM component to L3, it would also be valuable to explore the dimensional structure of the examination scores. While the moderate correlation between the MCQ and CDM component scores suggests that overlapping constructs are being measured, exploratory and confirmatory factor analyses could help provide additional evidence to support construct validity [37]. Finally, the set of criterion measures that we employed was fairly limited. Additional analyses of performance differences by specialty, examination timing, and Accreditation Council for Graduate Medical Education (ACGME) milestone ratings [38], to name a few, could help bolster the validity argument.

Conclusions

We collected multiple sources of the validity evidence within the scoring, generalization, extrapolation, and implication components of Kane’s validity framework. Most evidence supported the validity for the primary and intended use of the L3 examination for osteopathic physician licensure. Although the NBOME continues to collect validity evidence to support the intended use and interpretation of the L3 scores, other stakeholders are also responsible for evaluating the validity of the scores or other assessment results for purposes other than those specified by the examination.

Corresponding author: Xia Mao, PhD, National Board of Osteopathic Medical Examiners, Corporate Offices and National Center for Clinical Skills Testing, 8765 W. Higgins Road, Suite 200, Chicago, IL 60631-4174, USA, E-mail: xmao@nbome.org

Acknowledgments

The authors would like to thank NBOME staff, Dr. John R. Gimpel, DO, Med, who provided comments on a prior version of this manuscript, and Mingye Zhao, MSF, who aided with part of the data extraction for the analysis in this study.

Research ethics: This study was reviewed by the IRB of NBOME and it was deemed exempt.
Informed consent: This study was deemed exempt.
Author contributions: Xia Mao contributed to the conception and design, preparation, analysis and interpretation of data in the study, and the drafting of the article; John R. Boulet drafted a portion of the literature review and discussion in the article, revised it critically for important intellectual content, and assisted with the preparation of the manuscript for publication; Jeanne M. Sandella, Michael F. Oliverio and Larissa Smith provided information or consultation on important content within the article, reviewed and revised previous versions of the manuscripts, and gave final approval of the version of the article to be published; and all authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Competing interests: The authors state no conflict of interest.
Research funding: None declared.
Data availability: The raw data may be obtained on request from the corresponding author per NBOME approval.

References

1. National Board of Osteopathic Medical Examiners. COMLEX-USA level 3 — NBOME. https://www.nbome.org/assessments/comlex-usa/comlex-usa-level-3/ [Accessed 20 Nov 2022].Search in Google Scholar

2. American Educational Research Association. Standards for educational and psychological testing; 2014. http://www.aera.net/Publications/Books/Standards-for-Educational-Psychological-Testing-2014-Edition [Accessed 6 July 2018].Search in Google Scholar

3. Messick, S. Meaning and values in test validation: the science and ethics of assessment. Educ Res 1989;18:5–11. https://doi.org/10.3102/0013189X018002005.Search in Google Scholar

4. Kane, MT. The argument-based approach to validation. Sch Psychol Rev 2013;42:448–57. https://doi.org/10.1080/02796015.2013.12087465.Search in Google Scholar

5. Kane, MT. Current concept in validity theory. J Educ Meas 2001;38:319–42.10.1111/j.1745-3984.2001.tb01130.xSearch in Google Scholar

6. Backhouse, S, Chiavaroli, NG, Schmid, KL, McKenzie, T, Cochrane, AL, Phillips, G, et al.. Assessing professional competence in optometry – a review of the development and validity of the written component of the competency in optometry examination (COE). BMC Med Educ 2021;21:11. https://doi.org/10.1186/s12909-020-02417-6.Search in Google Scholar PubMed PubMed Central

7. Callahan, JL, Bell, DJ, Davila, J, Johnson, SL, Strauman, TJ, Yee, CM. The enhanced examination for professional practice in psychology: a viable approach? Am Psychol 2020;75:52–65. https://doi.org/10.1037/amp0000586.Search in Google Scholar PubMed

8. Shin, S, Kim, GS, Song, JA, Lee, I. Development of examination objectives based on nursing competency for the Korean nursing licensing examination: a validity study. J Educ Eval Health Prof 2022;19:19. https://doi.org/10.3352/jeehp.2022.19.19.Search in Google Scholar PubMed PubMed Central

9. Craig, B, Wang, X, Sandella, J, Tsai, TH, Kuo, D, Finch, C. Examining concurrent validity between COMLEX-USA level 2-cognitive evaluation and COMLEX-USA level 2-performance evaluation. J Osteopath Med 2021;121:687–91. https://doi.org/10.1515/jom-2021-0007.Search in Google Scholar PubMed

10. Maholtz, DE, Erickson, MJ, Cymet, T. Comprehensive osteopathic medical licensing examination-USA level 1 and level 2-cognitive evaluation preparation and outcomes. J Am Osteopath Assoc 2015;115:232–5. https://doi.org/10.7556/jaoa.2015.046.Search in Google Scholar PubMed

11. Hudson, KM, Tsai, T-HH, Finch, C, Dickerman, JL, Liu, S, Shen, L. A validity study of COMLEX-USA level 2-CE and COMAT clinical subjects: concurrent and predictive evidence. J Grad Med Educ 2019;11:521–6. https://doi.org/10.4300/JGME-D-19-00157.1.Search in Google Scholar PubMed PubMed Central

12. Roberts, WL, Gross, GA, Gimpel, JR, Smith, LL, Arnhart, K, Pei, X, et al.. An investigation of the relationship between COMLEX-USA licensure examination performance and state licensing board disciplinary actions. Acad Med 2020;95:925–30. https://doi.org/10.1097/ACM.0000000000003046.Search in Google Scholar PubMed

13. Gimpel, JR, Horber, D, Sandella, JM, Knebl, JA, Thornburg, JE. Evidence-based redesign of the COMLEX-USA series. J Am Osteopath Assoc 2017;117:253–61. https://doi.org/10.7556/jaoa.2017.043.Search in Google Scholar PubMed

14. Hudson, KM, Feinberg, G, Hempstead, L, Zipp, C, Gimpel, JR, Wang, Y. Association between performance on COMLEX-USA and the American College of osteopathic family physicians in-service examination. J Grad Med Educ 2018;10:543–7. https://doi.org/10.4300/JGME-D-17-00997.1.Search in Google Scholar PubMed PubMed Central

15. O’Neill, TR, Peabody, MR, Song, H. The predictive validity of the national board of osteopathic medical Examiners’ COMLEX-USA examinations with regard to outcomes on American board of family medicine examinations. Acad Med 2016;91:1568–75. https://doi.org/10.1097/ACM.0000000000001254.Search in Google Scholar PubMed

16. Li, F, Arenson, E, Song, H, Bates, BP, Ludwin, F. Relationship between COMLEX-USA scores and performance on the American osteopathic board of emergency medicine part I certifying examination. J Am Osteopath Assoc 2014;114:260–6. https://doi.org/10.7556/jaoa.2014.051.Search in Google Scholar PubMed

17. Wenghofer, E, Boulet, J. Medical council of Canada qualifying examinations and performance in future practice. Can Med Educ J 2022;13:53–61. https://doi.org/10.36834/cmej.73770.Search in Google Scholar PubMed PubMed Central

18. National Board of Osteopathic Medical Examiners. COMLEX-USA master blueprint — NBOME. https://www.nbome.org/assessments/comlex-usa/master-blueprint/ [Accessed 6 Dec 2022].Search in Google Scholar

19. De Champlain, AF. A primer on classical test theory and item response theory for assessments in medical education. Med Educ 2010;44:109–17. https://doi.org/10.1111/J.1365-2923.2009.03425.X.Search in Google Scholar

20. National Board of Osteopathic Medical Examiners. COMLEX-USA level 2-PE — NBOME. https://www.nbome.org/assessments/comlex-usa/comlex-usa-level-2-pe/ [Accessed 30 Nov 2022].Search in Google Scholar

21. Sandella, JM, Boulet, JR, Langenau, EE. An evaluation of cost and appropriateness of care as recommended by candidates on a national clinical skills examination. Teach Learn Med 2012;24:303–8. https://doi.org/10.1080/10401334.2012.715259.Search in Google Scholar PubMed

22. Smith, LL, Xu, W, Sandella, JM, Dowling, DJ. Osteopathic manipulative treatment technique scores on the COMLEX-USA level 2-PE: an analysis of the skills assessed. J Am Osteopath Assoc 2016;116:392–7. https://doi.org/10.7556/jaoa.2016.080.Search in Google Scholar PubMed

23. National Board of Osteopathic Medical Examiners. Blueprint — NBOME. https://www.nbome.org/assessments/comlex-usa/comlex-usa-level-3/blueprint/ [Accessed 20 Nov 2022].Search in Google Scholar

24. Shao, C, Liu, S, Yang, H, Tsai, TH. Automated test assembly using SAS operations research software in a medical licensing examination. Appl Psychol Meas 2020;44:219. https://doi.org/10.1177/0146621619847169.Search in Google Scholar

25. Kolen, MJ. Linking assessments: concept and history. Appl Psychol Meas 2004;28:219–26. https://doi.org/10.1177/0146621604265030.Search in Google Scholar

26. Webb, NM, Shavelson, RJ, Haertel, EH. 4 reliability coefficients and generalizability theory. Handb Stat 2006;26:81–124. https://doi.org/10.1016/S0169-7161(06)26004-8.Search in Google Scholar

27. Rudner, LM. Expected classification accuracy. Practical Assess Res Eval 2005;10:1–4. https://doi.org/10.7275/56a5-6b14.Search in Google Scholar

28. Prentice, DA, Miller, DT. When small effects are impressive. In: Kazdin, AE, editor. Methodological issues and strategies in clinical research, 4th ed American Psychological Association; 2016:99–105 pp.10.1037/14805-006Search in Google Scholar

29. De Champlain, A. Standard setting methods in medical education. In: Swanwick, T, Forrest, K, O’Brien, C, editors. Understanding medical education. Wiley; 2014, vol 71:305–16 pp.10.1002/9781118472361.ch22Search in Google Scholar

30. Cook, DA, Hatala, R. Validation of educational assessments: a primer for simulation and beyond. Adv Simul 2016;1:1–12. https://doi.org/10.1186/s41077-016-0033-y.Search in Google Scholar PubMed PubMed Central

31. Rubright, JD, Jodoin, M, Barone, MA. Examining demographics, prior academic performance, and United States medical licensing examination scores. Acad Med 2019;94:364–70. https://doi.org/10.1097/ACM.0000000000002366.Search in Google Scholar PubMed

32. Davis, D, Dorsey, JK, Franks, RD, Sackett, PR, Searcy, CA, Zhao, X. Do racial and ethnic group differences in performance on the MCAT exam reflect test bias? Acad Med 2013;88:593–602. https://doi.org/10.1097/ACM.0b013e318286803a.Search in Google Scholar PubMed

33. Balart, P, Oosterveen, M. Females show more sustained performance during test-taking than males. Nat Commun 2019;10:1–11. https://doi.org/10.1038/s41467-019-11691-y.Search in Google Scholar PubMed PubMed Central

34. Clauser, BE, Margolis, MJ, Swanson, DB. Issues of validity and reliability for assessments in medical education. In: Holmboe, ES, Durning, SJ, Hawkins, RE, editors. Practical guide to the evaluation of clinical competence, 2nd ed Philadelphia: Elsevier; 2018:22–36 pp.Search in Google Scholar

35. Pugh, DM, Wood, TJ, Boulet, JR. Assessing procedural competence: validity considerations. Simulat Healthc J Soc Med Simulat 2015;10:288–94. https://doi.org/10.1097/SIH.0000000000000101.Search in Google Scholar PubMed

36. Downing, SM. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract 2002;7:235–41. https://doi.org/10.1023/a:1021112514626.10.1023/A:1021112514626Search in Google Scholar

37. Strickland, OL. Using factor analysis for validity assessment: practical considerations. J Nurs Meas 2003;11:203–5. https://doi.org/10.1891/jnum.11.3.203.61274.Search in Google Scholar PubMed

38. Woodworth, GE, Goldstein, ZT, Ambardekar, AP, Arthur, ME, Bailey, CF, Booth, GJ, et al.. Development and pilot testing of a programmatic system for competency assessment in US anesthesiology residency training. Anesth Analg 2023. https://doi.org/10.1213/ANE.0000000000006667.Search in Google Scholar PubMed

Received: 2023-01-10

Accepted: 2024-02-14

Published Online: 2024-03-19

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

Frontmatter
General
Original Article
Implementation and mixed-methods evaluation of “Walk with a Doc” program at Stony Brook
Medical Education
Original Articles
Prevalence and quality of medical Spanish education in US osteopathic medical schools: a national survey
A validity study of COMLEX-USA Level 3 with the new test design
Neuromusculoskeletal Medicine (OMT)
Original Article
Effectiveness of osteopathic manipulative applications on hypothalamic–pituitary–adrenal (HPA) axis in youth with major depressive disorder: a randomized double-blind, placebo-controlled trial
Public Health and Primary Care
Original Article
Comorbidities associated with symptoms of subjective cognitive decline in individuals aged 45–64
Clinical Image
Tumid lupus masquerading as rosacea

https://doi.org/10.1515/jom-2023-0011

Keywords for this article

COMLEX-USA; level 3; licensure; osteopathic medicine; test scores; validity

Creative Commons

BY 4.0

Articles in the same Issue

Frontmatter
General
Original Article
Implementation and mixed-methods evaluation of “Walk with a Doc” program at Stony Brook
Medical Education
Original Articles
Prevalence and quality of medical Spanish education in US osteopathic medical schools: a national survey
A validity study of COMLEX-USA Level 3 with the new test design
Neuromusculoskeletal Medicine (OMT)
Original Article
Effectiveness of osteopathic manipulative applications on hypothalamic–pituitary–adrenal (HPA) axis in youth with major depressive disorder: a randomized double-blind, placebo-controlled trial
Public Health and Primary Care
Original Article
Comorbidities associated with symptoms of subjective cognitive decline in individuals aged 45–64
Clinical Image
Tumid lupus masquerading as rosacea