Many common misconceptions and barriers are holding states back from innovating and improving summative assessments, including the value of subscores. Assessment subscores, as they’re used right now, prevent states from adopting smarter, faster testing systems, but there are alternative visions for helping families and educators diagnose student learning needs.
Why we rely on subscores in the first place
The Every Student Succeeds Act (ESSA) requires states to have diagnostic information for individual students. The purpose of and intention for providing such information makes sense; it’s important to understand how students are doing in nuanced ways and evaluate how well instructional practices and programs in schools are working. Diagnostic information should, by definition, reveal the cause or nature of a problem.
Assessment subscores are currently the primary approach states use to provide diagnostic information intended to inform how well students are doing in specific areas of learning, like algebraic thinking in math and informational reading in English language arts. There are serious drawbacks to how the ESSA policy is currently implemented, and there are challenges to ensuring the policy has a stronger impact.
The trouble with looking only at subscores
The problem with limiting diagnostic information to assessment subscores is that the number of items needed to provide reliable information about student subdomain knowledge is far greater than what is typically or reasonably included on one test, at least when using traditional psychometric methods.
Assessments that are used for accountability are currently developed so that a score from one student or time can be compared to another student or time. One way assessments have historically ensured those comparisons are accurate is by building assessments that follow a similar structure. Like with a house, this structure is sketched out as a blueprint.
A test blueprint ensures each test has a similar structure by defining approximately how many items will be on the test in total (think of this as the size of the house) and about how many of those items will measure certain areas (these are the rooms in the house). In this metaphor, subscores represent the rooms in the house.
Summative tests simply aren’t always long enough to really provide useful diagnostic information from subscores.
Say we are building a mathematics test that measures the overall domain of mathematics. The blueprint will tell us what will be included in that test, such as questions on numbers and operations, fractions and decimals, algebraic operations, geometry, and data. The number of subdomains and how many questions address each results in the overall size of the test. What’s missing, however, is a more detailed view. We know how big our house is and how many rooms are in it, but not how many doors and windows there are, for example.
Building assessments like these that have comparability—at least in terms of face validity through common blueprints—has been the tradition. Naturally, there is a desire to interpret student performance in the subdomains for more diagnostic information. The challenge, then, is getting sufficient information from each domain without making tests longer. Yes, the more we observe how a student is doing, the better we know what they do or don’t know; yet most assessments using traditional methods require only five or six items per subdomain. While such a small number is appropriate to ensure a balanced representation of targeted subdomains for comparability, no one would argue that a five-item test would be reliable or valid enough to provide a score or inform important decisions. Summative tests simply aren’t always long enough to really provide useful diagnostic information from subscores.
How to get a more complete picture
ESSA doesn’t require diagnostic information to come solely from assessment subscores. In fact, many in the field have rightly warned against using subscores to eke out instructionally informative information.
Without just adding more questions, and making tests overly lengthy, there are other options for states:
- Include other kinds of data sources, including performance-based assessments, portfolios, and teacher-provided student evaluations. Those, of course, would require supportive and extensive professional development and process standardizations, at minimum.
- Try adaptive assessment, like the state summative assessments in Nebraska and Alaska. Their first priority in adapting is to ensure each student receives a test aligned to an overall blueprint for comparability and to provide a defensible overall score. The assessments then also find out more about student knowledge in subdomains and produce a more reliable subscore. By using a constraint-based assessment engine, the tests have the potential to fully personalize a student’s assessment experience by adapting even more diagnostically.
- Extract more information from assessment items. Items are developed to determine what a student knows, but they also include valuable information we can use to infer what a student doesn’t Multiple-choice-item distractors are developed to model common mistakes, misunderstandings, and misconceptions, for example. Extended response rubrics also highlight what a student doesn’t know in the lower scores of a rubric.
- Review and calibrate items to each state’s detailed achievement level descriptors, or ALDs. Teachers can see how a student’s overall score relates to detailed ALDs and explore what’s expected for getting to the next level or what concepts they need to review in prior levels, even across grades. Teachers can also see the standard and achievement level for each item each student received. This level of diagnostic information allows teachers to look at the data through the lens of what students know based on standards, achievement expectations, and what teachers have taught.
Change is possible
Providing meaningful information about how students are doing on state assessments is an important goal. If we truly want to make progress in this area, it’s vital we look at current policies and practices related to assessment subscores, consider advances in item development and assessment design, and even leverage information outside a singular test event. We believe we can do better.
What are your ideas on how we can improve diagnostic information from assessments? Let us know. We’re @NWEAPolicy on Twitter.
Thomas Christie contributed to this post. He is the senior director of learning and assessment engineering at NWEA, and his work focuses on maximizing the usefulness of educational data for students and teachers in the classroom.