Comparison of 3-Option and 4-Option Multiple Choice Questions and Related Issues and Practices

Thursday, September 19, 2024

Single-best-answer multiple choice questions (MCQ) are the most common question type in objective testing. While there are numerous variations of this question type, the most frequently utilized MCQs are those that include a stem or scenario and three to five options, with four-option MCQ being the most common. The number of options often depends on many factors such as the content that is assessed, cognitive level required to answer the question, resources in test development and previous statistics of the questions. Drawing on the empirical studies reviewed and conducted by the authors, this article presents practical methods and considerations in determining the optimal number of options when developing MCQs.

While three-option MCQs are generally less resource-intensive to develop, because of potentially compromised psychometric properties, test developers tend to favor four- or five-option MCQs (Baghaei et. al., 2011; Haladyna, 1993; Owen, 1987; Vyas et. al., 2008). On the other hand, psychometric literature has suggested that the statistical performance of MCQs is not necessarily a function of the number of options but may be related more to the quality of options including both the keyed response and distractors (i.e., the incorrect choices). For this reason, it is critical to examine how each option functions before deciding whether a three-option MCQ could effectively assess test takers’ ability and knowledge in specific contexts. In general, three options are sufficient for an MCQ if the two distractors are functional (i.e., reasonable selection rates and correlations with the total score are lower than the correlation between the correct answer with the total score). In other words, having more options is not always more effective if one or more distractors are non-functional (e.g., low selection rate with high correlations between distractors and the total score).

The first section of this article summarizes common methods in identifying non-functioning distractors that could be removed from an existing MCQ with known statistics. The second section of this article uses two case studies to show how to compare effectiveness of three- and four-option MCQs by analyzing psychometric properties of otherwise identical questions in these two formats.

Methods to Identify Non-functioning Distractors

Three methods are commonly used to identify non-functioning distractors.

Low Selection Rate: Non-functioning distractors (NFD) are often distinguished by identifying distractors with a selection rate of 5% or less of the total testing population (Tarrant et. al. 2009). These items are then categorized by the number of NFDs (i.e., one, two or three NFDs per item). For four-option MCQs, items with one NFD are essentially — yet unintentionally — performing as a three-option question. Items with one NFD can then be assessed to determine whether they meet the item characteristic requirements for use on a test (e.g., difficulty, discrimination).

Effect sizes can be calculated using the two parameters of selection frequency. Selection frequency is determined by the number of examinees who choose that option. The worst and second-worst performing distractors (i.e., selected the least) are evaluated to determine significant differences — or lack thereof — in effect sizes. Those that do not have significant differences in effect size are then examined further to determine if they still meet the item characteristic requirements to be considered as a functional item.

Choice Means: Choice means are utilized as a secondary method of effect size comparison. According to Haladyna and Rodriguez (2013), the mean of respondents who choose a distractor should be lower than the mean of respondents who choose the correct answer, and when those two values are similar, the distractor requires revision or removal. The distractor with the choice mean closest to that of the correct answer is isolated and an effect size is calculated between the two options. The pairings between the distractor and the correct choice that displayed non-significant effect sizes then have the distractors assessed for fit based on the aforementioned item characteristic to determine functionality.
Trace Lines: Trace lines, or item characteristic curves, can also be analyzed to identify three possible conditions outlined by Haladyna (2013). The presence of one of the three conditions suggests some form of non-functionality in the item and warrants further analytical consideration. Figures 1-3 illustrate the three conditions.

Figure 1: Trace Line Condition 1 – Equal selection of distractors across skill levels

Figure 2: Trace Line Condition 2 – Low frequency of selection across skill levels

Figure 3: Trace Line Condition 3 – Nonmonotonic increase on distractor, pulling middle performer

Comparison of Psychometric Properties Between 3- and 4-Option MCQs

Once an NFD or a least-effective distractor is identified, a four-option MCQ can be converted to a three-option MCQ. Various psychometric properties could be examined to determine whether the new three-option MCQ is psychometrically sound and defensible. The commonly analyzed psychometric properties may include test-taker performance, response time, item statistics such as difficulty, discrimination power and fit statistics — if Item Response Theory (IRT) is utilized in scoring and test modeling — and exam reliability. Subject matter experts’ involvement is recommended in the stage of statistical review. Two conversion case studies are summarized below to demonstrate the feasibility of the process and share lessons learned based on classical test theory and IRT.

Case Study From the American Board of Anesthesiology (ABA)

This study (Chen, et al., 2023) was conducted by the American Board of Anesthesiology (ABA), who piloted three-option MCQs for subspecialty in-training examinations for Critical Care Medicine (ITE-CCM) and Pediatric Anesthesiology (ITE-PA) in 2020. The three-option MCQs were derived from four-option MCQs by two approaches:

ABA staff editors removed the distractor that no examinees chose in the four-option format administered in 2019.
Exam committee members used their best judgment to determine which distractor to remove based on the distractor analysis of the four-option items.

The purpose of the analysis was to compare between the 2019 four-option MCQs and the 2020 three-option MCQs in physician performance, response time and item and exam characteristics.

Physician performance on the exam was measured by individual percent correct scores. Response time referred to seconds spent per item. Item difficulty was defined as percent correct scores by item (i.e., p-value) and item discrimination assessed the point-biserial correlation between item correctness and total score based on the rest of the items in the exam form (cRpb). Exam reliability was calculated using the Kuder and Richardson Formula 20 (KR-20).

For the ITE-CCM exam, 152 CCM fellows attempted the 2019 four-option items and 123 CCM fellows attempted 2020 three-option items. Physicians scored 2% higher on three-option CCM MCQs than their four-option counterparts (68% vs. 66%). Assuming similar abilities between two adjacent cohorts of CCM fellows in-training, this may indicate the three-option CCM MCQs were slightly easier. For the ITE-PA, 113 PA fellows attempted the 2019 four-option items and 132 PA fellows attempted the 2020 three-option items; there were no statistically significant differences in physician performance or item difficulty (both 72%). Physicians saved time answering three-option MCQs compared to four-option MCQs (3.4 and 1.3 seconds per item less for ITE-CCM and ITE-PA, respectively). There was no difference in item discrimination between three-option and four-option formats for either ITE-CCM (an average of 0.12 and 0.13, respectively) or ITE-PA (an average of 0.09 and 0.08, respectively). The reliability of ITE-CCM was comparable (0.74 for three-option and 0.75 for four-option), and the reliability of ITE-PA was slightly higher for three-option (0.67) than four-option (0.62) exams¹.

In summary, there were minimal changes in physician performance and psychometric properties when changing MCQs from four- to three-option. Based on these pilot results, the ABA transitioned its written examinations from four-option to three-option in phases by 2023.

Case Study From a Professional Exam

The data from this study came from a U.S.-based licensure program that owns and maintains multiple computer-based licensing examinations. The examinations were administered to candidates in the United States and Canada. Each examination targeted a different level of competency, ranging from basic to advanced practice.

Conversion of four-option MCQs was proposed based on NFD observations. Three examinations were included in the conversion process: basic, intermediate and advanced. First, the item bank for each examination went through a statistical review. MCQs with at least one NFD were flagged for content review. Next, content experts reviewed each flagged question to ensure the appropriateness of converting the item to a three-option format. Items that passed the content review were converted from four options to three options by removing the non-functioning distractor. No other changes were made to the converted questions.

The study investigated the impact of four- to three-option conversion on the psychometric properties including item difficulty, discrimination and Rasch model fit statistics. For each of the three examinations, new data was collected on 72 converted items (216 items across all three examinations). Data included 688 candidates for the basic exam, 5,405 candidates for intermediate and 5,188 candidates for advanced. Based on these data, new statistics were computed for each converted item and compared to the item statistics of the original four-option items.

Changes in item difficulty were measured by an IRT-based “drift” analysis. The analysis answered the question of whether the difficulty drifted to a statistically significant degree, becoming either significantly easier or significantly more difficult after the conversion. Drift is measured on the logit scale. A change in difficulty of more than 0.5 logits was classified as a significant drift. Of the 216 converted items, 24 (11%) were flagged with significant difficulty drift. Of these, 11 (5%) became more difficult and 13 (6%) became easier. Note that some amount of drift is typical, even among items that have undergone no content changes as certain content areas may over time become less relevant or less covered in educational programs. The average difficulty drift over all 216 items was near zero, suggesting that converted items (as a whole) showed no systematic increase or decrease in difficulty.

Changes in item discrimination were also investigated. Discrimination was measured using the point-biserial correlation. Of the 216 converted items, only four (2%) had point-biserial correlations that dropped below 0.1. Of these, none were negative. The average difference in point-biserial correlation between the converted (three-option) and original (four-option) versions was -0.02. These findings suggest that the impact of conversion on item discrimination was negligible.

Finally, changes in model fit statistics were compared between the converted and original versions. Two fit statistics were used to examine model fit: mean-square infit and mean-square outfit. Both statistics provide a measure of how well the data fits the underlying measurement model (the IRT-based Rasch model in this case). Values of either infit or outfit greater than 1.5 suggest poor model fit (i.e., “misfit”) and are therefore not suitable for measurement. Of the 216 converted items, only two (< 1%) showed misfit on the infit statistic and only four (2%) showed misfit on the outfit statistic. The average difference between the converted and original items (over all 216 items) was 0.00 for infit and 0.01 for outfit. These findings suggest negligible impact of the conversion on model fit.

The same comparisons described above were also performed by exam level. The exam level results are consistent with the overall results, all showing minimal impact. The basic exam showed slightly more poor performing questions after conversion. However, this difference is likely due to lower volumes compared to the intermediate and advanced exams.

Overall, these studies provide empirical evidence that conversion of four-option MCQs by removing an NFD has minimal impact on the psychometric properties of these questions. Some performance changes were observed, but changes in item statistics were not outside of what is typically observed in the item review process. These findings are encouraging, but caution should be exercised to not unnecessarily extrapolate results to other testing programs. Test developers considering similar four- to three-option conversions are advised to carry out similar analyses to understand the impact within their testing context, including the complexity of the construct being measured and the ability distribution of their candidate population.

These findings also support the notion that quality of distractors is more important than quantity. Regardless of the number of MCQ options, test developers are advised to follow best practices by carefully utilizing psychometric procedures mentioned in the article to identify NFDs and evaluate ways to improve the effectiveness and efficiency of the testing program. Factors such as subject matter experts’ involvement and test taker experiences should also be considered in the decision making.

References

Baghaei, P. & Amrahi, N. (2011). The effects of the number of options on the psychometric characteristics of multiple choice items. Psychological Test and Assessment Modeling, 53.192-211.
Chen, D., Harman, A.E., Sun, H. et al. A comparison of 3- and 4-option multiple-choice items for medical subspecialty in-training examinations. BMC Med Educ 23, 286 (2023). https://doi.org/10.1186/s12909-023-04277-2
Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items.
Routledge. https://doi.org/10.4324/9780203850381
Owen, S.V., & Froman, R.D. (1987). What's wrong with three-option multiple choice items? Educational and Psychological Measurement, 47, 513 - 522. https://doi.org/10.1177/0013164487472027
Rogers, W. T., & Harley, D. (1999). An empirical comparison of three-and four-choice items and tests: Susceptibility to testwiseness and internal consistency reliability. Educational and Psychological Measurement, 59(2), 234 - 247. https://doi.org/10.1177/00131649921969820
Tarrant, M., Ware, J., & Mohammed, A. (2009). An assessment of functioning and nonfunctioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9, 40. https://doi.org/10.1186/1472-6920-9-40.
Vyas, R. & Supe, A. (2008). Multiple choice questions: A literature review on the optimal number of options. The National Medical Journal of India. 21. 130-3.
Wolkowitz, A. A., Foley, B. P., Zurn, J., Owens, C., & Mendes, J. (2021). Making the switch from 4-option to 3-option multiple choice items without pretesting: Case study and results. Institute for Credentialing Excellence: Credentialing Insights. https://www.credentialinginsights.org/Article/making-the-switch-from-4-option-to-3-option-multiple-choice-items-without-pretesting-case-study-and-results-1

¹Please note Subspecialty ITEs are designed to help training programs evaluate fellows’ progress as they advance through subspecialty training, with no pass/fail decisions. Subspecialty certification exams include more items, are attempted by a more heterogeneous group of candidates, including retakers who failed previous attempt(s), and achieve higher reliability than ITEs (>0.80).