One of the hallmarks and frustrations of the field of education is the imprecision of the language. In education we call the same thing by many names, or we use the same term to mean many different things. One example of the former is the scoring of student writing by a computer algorithm: automated essay scoring, artificial-intelligence scoring, automated essay grading and machine scoring are a sample of the terms. Two of these terms pop up in a recent EdWeek Curriculum Matters blog post. The post is entitled English Teachers Group Opposes Machine-Scored Writing and Catherine Gewertz concludes the post with this assessment:
The viability of artificial-intelligence scoring on the common assessments is a powerful cost manager for the two groups of states that are designing tests for the common standards. If they decide that humans must score the essays, the expense of the tests soars. And cost is, of course, high on states’ radars as they weigh their continued participation in the two groups.
The blog title echoes The National Council of Teachers of English phrase “machine scoring” from their position paper, Machine Scoring Fails the Test. “Machine scoring” sounds mechanical and mindless which captures the NCTE view nicely, while Gewertz’s phrase “artificial intelligence scoring” suggests the process may in fact be intelligent and clever. It seems NCTE views automated essay scoring (AES) as very similar to computer estimations of readability (e.g. Lexile, Flesch-Kincaid) which only look at a couple text features, one concerning vocabulary and one concerning sentence length. In fact, AES algorithms are much more comparable to the work on text cohesion done at the University of Memphis, Coh-Metrix (http://cohmetrix.com). Grounded in linguistics, estimating text cohesion looks at over 80 features of text. Similarly based in linguistics, AES algorithms look at multiple text features.
Gewertz’s blog views the question of whether the consortia will use AES as an open one, but everything I read including the test blueprints recently released by PARCC indicates that student writing will be “hand scored”, another odd term, which means humans will read the writing and assign a score point based on a rubric. Now, if the consortia had chosen to use AES to score writing, I would not be wringing my hands though I think the ideal use of AES is in conjunction with human readers. It is this use of AES that I would recommend to curriculum developers, principals and ELA supervisors—having been in all those roles myself—as they find ways to both manage the increased writing demands of CCSS implementation and as a way to assure better quality scoring of student writing.
Thinking first of a summative purpose for a writing assignment—an assessment of student proficiency, AES can help overcome some of the weaknesses associated with human scoring. When humans score only for a summative purpose, for instance essays written for a final exam, they score quickly and often focus on superficial features that may be proxies for quality. Using AES can generate a second score for each essay when it is not appropriate to ask a second teacher to rate the essays. Two data points are always better than one. This can help avoid many issues with teacher scoring. One that is well-documented is the concept of drift. As a teacher scores a set of essays, the scoring tends to drift over time. A paper at the end of a set gets a different score than it might at the beginning of a set. Good teachers often go back and review the first couple essays scored and compare them with the last couple scored to make sure their ratings have remained consistent.
In addition to providing a second view of essays, there are other advantages of coupling AES scores with human ratings including supporting teachers in the humanities with their essay scoring. Allowing the AES to focus on what it does well, allows the content area teacher to focus on what she does best, evaluating content. These uses of AES are not replacing the human rater but providing an extra data point to help produce greater reliability of scores.