I mean it. I really do. Let me explain why as quickly as I can.
First, what do I mean by error?
Think about a typical state test – it’s a paper and pencil test designed to measure proficiency at a grade level. The questions on this test ask about the content included in the state standards, with the difficulty levels of the questions aligned to a particular grade level. Are we good so far? Good.
For kids who are on or very close to grade level, there are some questions on the test that they will get right and some that they will get wrong. Because there is some variation in the rightness or wrongness (if these are words), there is a decent amount of information for the test to work with to determine what a student knows and can do. The more information, the smaller the error.
Let me ask you something. How many kids in a classroom are on grade level? Isn’t it like this picture where a decent percentage of kids are either above or below grade level?
For the error to be small, the questions asked of a kid need to be close to their level of knowledge. This takes a big question bank, and a test that can present any item from that bank to any child – just sticking to grade level questions doesn’t do it.
Keep this phrase in the back of your mind . . . garbage in, garbage out. Here are examples from two hot topics: using assessment data in evaluating teachers, and AYP determinations.
The Kingsbury Center published a study (check it out if you want) in November 2011 about selecting assessments for use in evaluating teachers. A few selected excerpts:
- This policy brief will discuss why state proficiency exams as they currently exist are not an appropriate foundation for computing value-added measures . . .
- For students whose true performance lies at the ends of the normal distribution, the measurement error of minimum proficiency assessments can be shockingly poor.
- At the 25 student [single class simulation] level, the Value Added Model (VAM) based on the Texas Assessment of Knowledge and Skills (TAKS)misidentifies 35% of all teachers. A misidentified teacher was one who appeared to have growth which was incorrect by more than one-half a year (less than .5 years or more than 1.5 years).
Do you think that these 35% will think this process is fair?
A May 2004 Delaware Policy Brief called “Testing: Not an exact science” identified similar issues that have AYP implications:
- In 2001 77% of third grade students were accurately classified [into one of five performance levels] in reading. Consequently, in 2001 some 23% of Delaware’s 3rd grade students were misclassified.
- It was found that 75% of 8th grade students who took the math DSTP in 2003 were accurately classified, leaving 25% suffering from the “inevitable consequence” of imperfect measurement.
In Delaware, changes in an individual student’s performance level contribute to the growth model AYP determination. Do you care if your school is classified by NCLB as “Under Improvement” based on data that is wrong 25% of the time?
Now it’s your turn
So what do you think? If a large amount of measurement error puts the data portion of teacher evaluations on shaky ground and makes AYP determinations less than solid, can error change your life?