Measurement and Standard Error

This morning when I stepped on my bathroom scale and felt that familiar twinge of guilt and disappointment, I quickly reminded myself that bathroom scales are imperfect measuring devices.  In all probability, my true weight falls within some range of possible values, centered roughly on the indicated weight. Quickly, I calculated how much less I wanted it to be.  That value, I decided, must surely be the margin of error for my bathroom scale.

Geeky rationalizations aside, the act of measuring human (or other) attributes is always an imperfect science. Whether we’re trying to measure weight with a bathroom scale, height with a tape measure, or academic achievement using the MAP assessments, there is always some wiggle room in our measurements because there are limits to how precisely these quantities can be measured. Observed MAP scores are always reported with an associated standard error of measurement (SEM). For example, if a student scored a 195 on the MAP Reading test with a SEM of 3 RIT points, then within the limits of our ability to measure, 195 is her/his most probable score, but the “true” score could be a little bit higher or a little bit lower. The standard error relays just how much higher or lower. Smaller standard errors mean more precise measurements.

In general, the precision of observed MAP scores can be boosted (i.e., SEMs decreased) in two ways:  increasing the number of items within a test event, and by including only items whose difficulty is within relative proximity of the student’s current achievement level.  More items within a test mean more opportunities to observe the student’s achievement, and consequently greater precision.  At the same time, missed items that were far too difficult, or correct responses to items that were far too easy provide little information about a student’s current achievement.  This is why reading tests, with about 42 items each, tend to have slightly larger SEMs than math tests, which have about 50 items each.  It also explains why adaptive tests tend to be more precise than fixed form tests of similar length, since adaptive tests select harder items when a student does well and easier items when they do poorly.  For a fixed form test to measure students of all achievement levels with equal precision, it would have to be far too long to be practical.

Standard errors combine when we’re trying to measure an individual over time.  For example, if I want to know how much growth has taken place for a student over time, and administer MAP in the fall and again in the spring, I need to consider the standard errors of measurement at both time points in order to make a realistic assessment of how much growth has occurred. If the reading student from the example above were measured a second time, and scored a 212 with a standard error of 3, then the observed growth would be 17 RIT points. The standard error of the change score would be 4.24, which is simply the square root of the squared and summed individual standard errors. In this example, the change from fall to spring (17 points) is relatively large compared to the standard error of the change score(4.24), so we can be very comfortable in concluding that real growth has taken place. However, if our hypothetical student had only scored a 199 (with a standard error of 3) on the second test administration, our conclusions would be much less certain. In this second hypothetical, the observed growth is only 4 points, and the standard error of growth is still 4.24. In other words, the observed growth is no greater than the standard error, so we cannot conclude with any certainty that real improvement has occurred.

While standard errors can sometimes be troublesome for interpreting individual scores, they are less so when examining groups. The reason for this is that under most circumstances those measurement errors are random. Sometimes they are a little bit high, sometimes a little bit low. And for the most part, when you look at the group, they tend to balance each other out. This is why when you look at groups, you can measure the standard error of the group’s mean much more precisely (that is, with much lower standard error) than you can for an individual.  In other words, even when individuals show little growth over time, group level growth can be measured with much greater precision and certainty than individual level growth.

All achievement tests contain some amount of measurement error.  But because MAP adapts to a student’s current achievement level, MAP scores are as precise as they can be, and far more precise than fixed form tests of similar length.  Understanding student’s observed scores, and what the standard errors tell us about the observed scores, can help us to set more reasonable goals and draw more valid conclusions about students’ performance and growth in achievement over time.