As we’ve all been taught, to dive safely into a swimming pool you must first be sure that the water is deep enough to extend past your head and feet. Ensuring the validity of computer adaptive tests (CATs) is no different; the item pool must be deep enough to stretch above and below a student’s entry point.
A well-constructed item pool is an essential part of a CAT, such as MAP Growth. One important element of an item pool is that it needs to include enough items to enable the building of numerous individualized tests that align to students’ varying ability levels; it needs to include enough breadth to cover the scope of the content domain.
CATs adapt to individual student performance. They get harder or easier depending on how a student is performing on the test, which requires a deep item pool from which many different tests can be drawn.
A student’s grade level is not necessarily his or her instructional readiness point; therefore, a CAT must adapt to measure on-, above-, and below-grade abilities. An assessment that informs educators about each student’s instructional readiness draws on content that spans across grades. A deep item pool can provide this because it will be stocked with items that correspond to many different grade levels.
How many items are enough?
The appropriate size of the item pool depends on four main factors.
- Precision is the first factor to consider, as it relates to the “estimate of student achievement that is desired.” (Reckase, M.D. “Designing item pools to optimize the functioning of a computerized adaptive test.” Psychological Test and Assessment Modeling. Volume 52, 2010 (2), 127-141). The more precision you desire, the larger your item pool needs to be. If you are aiming to get just a rough estimate, you can use a smaller item pool.
- Range is another significant factor. How broad or narrow is the range of achievement to be measured? A larger item pool will be required for assessment that is very broad, since it will include items with a large range of difficulty. For example, if an assessment is being used to measure students’ performance at multiple depth of knowledge (DOK) levels, it will require a greater range of items than an assessment concerned with only one DOK level.
- Stakes are a third factor that will determine the item pool size requirement. If a CAT is very high stakes, students might be more likely to game the test. Large item pools improve the chance that examinees receive a different set of test items for every test administration, making it impossible to cheat the system.
- Number of times a CAT is administered is a fourth factor of importance. If an assessment is administered to the same students multiple times a year, for instance, the item pool must be large enough to ensure that a student doesn’t see any item more than once.
The goal is to have enough items in each desired content area to assemble an individual test with the balanced content coverage required by the test. (Gu, L. & Reckase, M.D. (2007). “Designing optimal item pools for computerized adaptive tests with Sympson-Hetter exposure control.” In D.J. Weiss (Ed.), Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing, Retrieved 10/14/14 from www.psych.umn.edu/psylabs/CATCentral/)
A deep pool of items isn’t very valuable if the items themselves aren’t high quality. Field testing enables identification of items that are performing atypically. Poorly performing items should be removed from the item pool as soon as they are identified to avoid proficiency estimation errors. Additionally, a rigorous calibration process builds confidence that an item is likely a good measure of the attribute in question. This is another instance where a deep pool of samples creates a high degree of accuracy. This is why we base calibration on more than 1,000 student responses from MAP Growth, which is one of the most stringent calibration processes in the education assessment field.