(Continued from part-1)
So, I started to read their research findings.
(It would have been nice if they also gave online links to those references. Nevertheless, it was not very difficult to google most of them).
Finding 1: computer algorithms cannot recognize the most important qualities of good writing, such as truthfulness, tone, complex organization, logical thinking, or ideas new and germane to the topic (Byrne, Tang, Truduc, & Tang, 2010)
(This report referred here can be read here)
In this, after a brief discussion on the development of eGrader, an automatic grading system, they conclude with the following words, after deciding to not use it for automatic grading.
“The machine reader appears to penalize those students we want to nurture, those who think and write in original or different ways. For us, the subjective element which was as important as the objective aspects of the essays, proved too complex to measure.”
-It is indeed a matter of concern (this above statement) …. but, I wondered if there really was sufficient evidence (at least in this referred report) to extrapolate this observation with a small sample of 33 students to phrase the Finding 1.
10 students in this report came back to the human evaluators for a regrading and got better grades after that. But, won’t this sort of disagreement also happen if there are human evaluators instead of machines? Can we be sure that two humans always give the same score to a document without disagreement? What will we do when there is a disagreement between two human scores? May be we call a third evaluator? Isn’t that what they do with machines too now?
Also, I guess automated scoring, wherever its used in a high stakes assessment scenario (like…grading in an entrance exam or in some competitive proficiency test etc), is not used as a sole decider. It seems to be coupled with atleast one human judge (both GRE and GMAT automatic assessment has atleast one human judge and a second one will be called if there is a disagreement between human and machine scores).
So, what I understood: the “finding 1” might still be true … but this document that is referred there is not related to that statement. Its more like a “null hypothesis” than a “finding”.
Finding 2. to measure important writing skills, machines use algorithms that are so reductive as to be absurd: sophistication of vocabulary is reduced to the average length or relative infrequency of words, or development of ideas is reduced to average sentences per paragraph (Perelman, 2012b; Quinlan, Higgins, & Wolff, 2009)
(I could not trace a freely accessible link to the first reference. Second one here).
Its amazing that the same system that uses so many different measures to estimate various aspects of the language quality of an essay (see the last 4 pages of the pdf above) uses a relatively “surface” measure of lexical sophistication and development (No, I don’t have any better ways to measure it. Don’t ask me such questions!!). However, I think “prompt specific vocabulary usage” description in the end actually handles the sophistication of vocabulary part to some extent. And there is always atleast one human grader to identify things like “out of the box” thinking or novel word usages that are relevant and also competent. These automatic assessment systems don’t seem to be the sole decision makers anyway!
So, I don’t understand the issue, again. Further, I am amazed that so many other positive things from this second report were completely ignored and directly a “finding 2” was finalized by skipping all those!
Finding 3 machines over-emphasize grammatical and stylistic errors (Cheville, 2004) yet miss or misidentify such errors at intolerable rates (Herrington & Moran, 2012)
(The first reference does not seem to be freely available.)
-It was clear from the beginning that the authors of the second paper have a strong bias against automated scoring. Its entirely acceptable .. everyone can have some opinion. But, then, we can’t expect that paper to be objective then, can we? Atleast I did not find it objective. I thought there would be some form of well defined scientific study there. But, all they discussed was one single student essay and generalized that to the performance of the automated system as a whole (I wonder why that is not possible to do even the other way round by finding an essay where it works and conclude equally confidently that automated system is the best! :P). Further, I felt that an analysis as to why the automated system showed those errors was not performed. A major criticism about spelling was that the machine identified words like “texting”, “i.e.” as spelling errors. But, if “i.e.” was supposed to be written as “i.e.,” and texting is not a word in say, English dictionary, I think this should be expected. Infact, I would guess that a very conservative human evaluator too might point these things out. So, based on this single student essay analysis (some of which is debatable), its concluded that the tool should be banished from classrooms… (this is where I start getting disillusioned… should I really continue these readings?? They seem so biased and the analysis seems more like over-generalization than science anyway!)
The only interesting point in this report was about the machine’s bias towards standard American English. I am curious to know more on this aspect. Actually, I did find the starting premise (that the machines overemphasize stylistic errors) interesting… but this report did not live up to its own promise in terms of the quality of analysis provided.
Finding 4 machines cannot score writing tasks long and complex enough to represent levels of writing proficiency or performance acceptable in school, college, or the workplace (Bennett, 2006; Condon, 2013; McCurry, 2010; Perelman, 2012a)
Report 1 – Bennett, 2006: There are two major claims in this. 1) “Larger differences in computer familiarity between students with the same paper writing proficiency would be associated with correspondingly bigger discrepancies in computer writing scores” and 2) “The weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing expert”
-Actually, although 1) seems rather obvious (as we need to “type” essays on a computer), there is no real solution that this report proposes. 2) Ofcourse, the weighting would be different between humans and machines. Machines don’t learn like humans and humans don’t learn like machines! But, when there is not so much of discrepancy between both their scores…and so long as the users are satisfied, I guess this is something that we can live with. Anyway, the paper did not suggest any better alternative.
Condon, 2013; McCurry, 2010 are not freely accessible.
Report4 – Perelman, 2012a: The conclusion of this report is that the way automated essay scoring systems evaluate the language constructs is not actually the way actual writing teachers do it. Although this is an important point to be addressed,
the other person in me is always ready to say: what we learn and what the machine learn need not always be the same. Our routes to reach the same conclusion might really not cross at all!
Concluding this part:
In a sense the whole thing is really amazing. On one hand, they talk about the shortage of human evaluators to grade student scripts. On the other hand, they want a ban on automated assessment. I wonder what exactly is the point of humanreaders.org, even after reading so many reports! I don’t understand their solution, if there is any, yet.
The other claims and their associated reports might help (I hope!)
(To be continued)