Skip to content

How ABIM Scores Assessments

July 30, 2024  |  Posted by ABIM  |  ABIM Process

ABIM considers the process of scoring an assessment critical as it has large consequences for examinees. Scoring is a thorough and rigorous process that involves several steps that must occur before the actual scoring even begins. This includes a careful development of assessment items (questions) by practicing physicians and an intense scrutiny of each item by an Approval Committee consisting of physician experts practicing in the field.

After that comes an intensive statistical evaluation of the items and responses by psychometricians. Psychometrics, much like medicine, is a science of assessment and is governed by a set of guiding principles. In medicine, consensus-driven guidelines are used. Psychometrics uses the Standards for Educational and Psychological Testing, which outlines best practices in assessment, including those employed at ABIM. 

Every aspect of the standards for testing is used in scoring an ABIM assessment—from item analysis and automatic test assembly to equating and standard-setting—to ensure that assessments measure knowledge accurately and fairly, questions have only one correct answer, and the score a physician receives is adjusted for the difficulty level of the test so the experience for examinees is similar across administrations. The ultimate goal of all ABIM assessments is to ensure they are fair and reliable, and have valid score interpretations.

The process of item analysis begins when each question is evaluated statistically using the examinee responses. The process involves two metrics: item difficulty and item discrimination. Item difficulty refers to how hard or easy an item is for examinees to answer correctly. Item discrimination is the correlation between item performance (correct or incorrect responses) and overall exam performance (an examinee’s total score).

This process ensures questions accurately measure physician knowledge, that any questions that have skewed results—for example, those scoring high on the overall exam respond incorrectly to an item while those scoring low on the overall exam respond correctly to that item—are flagged and reviewed by the experts and are potentially removed from scoring. This process confirms the validity and reliability of the assessment.

“The process ensures every assessment is fair and reflects current medical knowledge and practice,” said Rebecca Lipner, Ph.D., Senior Vice President of Assessment and Research. “Medicine changes at a rapid rate, and we want to make sure our assessments take that into account. Even more importantly, we want to make sure we are measuring what a group of practicing clinical experts in the specialty has determined is essential for physicians to know to practice in the field.”

Once initial processing and item analysis are complete and the team is confident that all questions are free from any known flaws and have a single best answer, the process of calibration and scoring begins.

ABIM uses Item Response Theory (IRT), typically considered best practice in the assessment and measurement field, to score its exams. IRT allows psychometricians to give scores that can be comparable across years and exam forms.

Medicine can change rapidly as new research and guidelines are published. In order to ensure the correct answers are in line with current best practices, an item analysis process is done. Psychometricians and exam developers review the results of the item analysis along with examinees’ comments to flag any items with potential issues.

Information on potentially problematic questions is sent to the chair of the specialty’s Approval Committee for review. If the chair determines that there is not a single best answer to the question or there is a problem with the question, the item is subsequently removed from scoring.

“We do our best to make sure each assessment is comparable no matter when it is taken,” said Dr. Lipner. “The automatic test assembly and equating processes make sure that scores and standards are comparable across administrations. Scoring an assessment is a multi-step, thorough process that ensures that the results are fair, reliable and valid.”