Finding 5: machines require artificial essays finished within very short time frames (20-45 minutes) on topics of which student writers have no prior knowledge (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Jones, 2006; Perelman, 2012b; Streeter, Psotka, Laham, & MacCuish, 2002; Wang, & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)
Fifth Report (just search with the title in google): The primary conclusion of this report is: “The automated grading software performed as well as the better instructors in both trials, and well enough to be usefully applied to military instruction. The lower reliabilities observed in these essay sets reflect different instructors applying different criteria in grading these assignments. The statistical models that are created to do automated grading are also limited by the variability in the human grades.”
This report studied the correlation between the machines and human raters and concluded, contrary to all previous studies, that there is no significant correlation between them.
While this was an interesting conclusion, I had a few questions which I could not clarify better as those references couldn’t be found online (for free). Here are the questions I have.
1. I think all these Automated Scoring systems work well within their framework. Like, the GRE exam grading works typically for GRE exam like essays but not for say, scoring what I write in my blog. But, they all are really customizable to any such target-specific responses (The NYT article on EdX, which began this series of blog posts from me makes that clear. It needs 100 sample essays and it will learn how to score essays from 101th on, to put it simply.).. So, is it fair to test Intellimetric or whatever system, on some essays that are about something else…and from other domain?
2. As far as I remember, all those other reports used pearson’s correlation. This report used spearman’s correlation. Can we compare the numbers we get directly? Why cant both be included?
-so far except the 4th one, the others seemed to be more about how well the automated scores approximate human judgements (looking at all the abstracts, which are freely available to read!!)
Finding 6: in these short trivial essays, mere length becomes a major determinant of score by both human and machine graders (Chodorow & Burstein, 2004; Perelman, 2012b)
I actually found this report to be a very interesting read, for several reasons. Some of them are mentioned below:
“Since February 1999, ETS has used e-rater as one of the two initial readers for the GMAT writing assessments, and, in this capacity, it has scored more than 1 million essays. E-rater’s scores either match or are within one point of the human reader scores about 96% of the time”
-I thought this is an amazing performance (especially with such a high percentage even after grading 1 million essays or more!). I also like this part where they compare the machine-human agreement with human-human agreement, which is why I again think that no single entity should be the sole-scorer (there should be either 2 humans, 2 differently built machines or machine-human combinations).
While I am not completely comfortable with an automated system being the sole scorer in the low-stakes high school exams etc, I have the same feeling about human evaluators too. I think bias can exist both in a human and in a machine (in a machine, because its intelligence is limited. It learns only what it sees). I guess at that level, you always have an option to go back and demand a re-evaluation?
Primary conclusion of this report was that: “In practical terms, e-rater01 differs from human readers by only a very small amount in exact agreement, and it is indistinguishable from human readers in adjacent agreement. But despite these similarities, human readers and e-rater are not the same. When length is removed, human readers share more variance than e-rater01 shares with HR.”
-I think this is what the humanreaders.org use to come up with this finding. I do not know how to interpret this particular finding although it disappoints me a bit. Yet, the ETS report made a very interesting read.
A second conclusion that this “finding” does not focus on, but I found interesting, was the effect of the test taker’s Native Language on the overall scores. “There is a main effect for language in the data of the current study. The mixed and Spanish groups have higher scores on average than the Arabic and Japanese. These
differences remain even when length is removed. For most of the prompts, e-rater01 shows the same pattern of differences across native language groups as HR, even with length differences partialed out. Future work should include additional language groups to see if these results generalize.”
– After the recent Native Language Identification shared task, I got really curious on this very topic – the interaction between the native language and the scores that the test takers get. I think this is something that I might need to study further, irrespective of the disappointments during these readings!
Again, only the first report (from ETS) is freely accessible. If the purpose of humanreaders.org was to actually convey their message to a “curious” commoner, I think they are repeatedly failing in this by providing such inaccessible references. :-(….
Finding 7: machines are not able to approximate human scores for essays that do fit real-world writing conditions; instead, machines fail badly in rating essays written in these situations (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Condon, 2013; Elliot, Deess, Rudniy, & Joshi, 2012; Jones, 2006; Perelman, 2012b; Powers, Burstein, Chodorow, Fowles, & Kukich, 2002; Streeter, Psotka, Laham, & MacCuish, 2002; Wang & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)
First report, second, third, fifth, sixth, seventh, tenth reports are not freely accessible (although looking at the abstracts, I did not get an impression that these reports are against automated scoring in any way!).
A version of 4th report is available. Even this did not give me an impression that it has anything against automated scoring as such.. although there is a critical analysis of where automated scoring fails and it ended with a call for better systems.
(8th and 9th reports are already discussed in Finding 5)
-Again, I understand that this is an issue … but I don’t understand how just avoiding automated scoring will solve it.
Finding 8: high correlations between human scores and machine scores reported by testing firms are achieved, in part, when the testing firms train the humans to read like the machine, for instance, by directing the humans to disregard the truth or accuracy of assertions (Perelman, 2012b), and by requiring both machines and humans to use scoring scales of extreme simplicity
– Like I said before, I could not find a freely accessible page for (Perelman, 2012b). It would have been great if some freely accessible report was provided as additional reading here… because this claim/finding is among those most intriguing ones. First, they say that the machines are bad. Now, they say even the humans are bad! Should we be evaluated at all or not? 😛
Conclusion for this part:
… And so, although its getting more and more disappointing… I decided to continue with these reports further because, even if the analysis and those claims (of humanreaders.org) are flawed according to me, I find those “findings” (or whatever they are) worth pondering about. There may not be a real solution. But, some of these issues need some thinking on the part of people who are capable of doing it better (e.g., teachers and researchers in education, researchers in automated scoring).
(to be continued..)