Automated Grading and Counter Arguments-4 (last)

(Continued from Part-3)

Finding 9: machine scoring shows a bias against second-language writers (Chen & Cheng, 2008) and minority writers such as Hispanics and African Americans (Elliot, Deess, Rudniy, & Joshi., 2012]

Report-1: The best part about this report is the stance it takes. It immediately got me interested in it.

“Given the fact that many AWE programs have already been in use and involve multiple stakeholders, a blanket rejection of these products may not be a viable, practical stand.

A more pressing question, accordingly, is probably not whether AWE should be used but how this new technology can be used to achieve more desirable learning outcomes while avoiding potential harms that may result from limitations inherent in the technology.”

-This exactly has been my problem with these statements on website.

The study primarily supports the idea of integrating human and machine assessments, by taking advantage of good things in both, as mentioned below:

“The AWE implementation was viewed comparatively more favorably when the program was used to facilitate students’ early drafting and revising process, and when the teacher made a policy of asking students to meet a preliminary required standard and subsequently provided human feedback. The integration of automated assessment and human assessment for formative learning offers three advantages….”

At least I did not find a direct mention about a “bias” against second language writers in this report! We need to stretch our imagination a bit to reach that conclusion!

Report-2: – The second report was already mentioned in Finding 7. Like before, I did not find a direct relevance of these results to this “finding”. However, I see the point in raising this issue. But, what I don’t understand is that this is just like some American coming and correcting Indian English 😛 So, this kind of “bias” can exist in humans as well. What really is a way to handle this, manually or automatically? This does not make the case favourable to human assessment (IMHO).

Finding 10: for all these reasons, machine scores predict future academic success abysmally (Mattern & Packman, 2009; Matzen & Hoyt, 2004; Ramineni & Williamson, 2013 – not freely accessible)

– I actually did not go through these references beyond their intro and conclusion sections as I felt that “finding” is too much of a blanket statement to be connected to these findings. Skimming through the two freely accessible reports among these confirmed my suspicion. These reports focus more on doing a critical analysis of automated systems and suggesting ways to improve them and combine them with some other things… and not say “machine scores predict future academic success abysmally”.

Part-2 of these findings were focused on the statement: “machine scoring does not measure, and therefore does not promote, authentic acts of writing”

While I intend to stop here (partly because its so time consuming to go through so many reports and partly because of a growing feeling of irritation with these claims), some of these findings-part 2 made me think and some made me smile and some made me fire a question back…

Those that made me think:
“students who know that they are writing only for a machine may be tempted to turn their writing into a game, trying to fool the machine into producing a higher score, which is easily done ”
-This is something that recurs in my thoughts each time I think of automated assessment.. and I know it is not super-difficult to fool the machine in some cases.

“as a result, the machine grading of high-stakes writing assessments seriously degrades instruction in writing (Perelman, 2012a), since teachers have strong incentives to train students in the writing of long verbose prose, the memorization of lists of lengthy and rarely used words, the fabrication rather than the researching of supporting information, in short, to dumb down student writing.”
– This part is actually the main issue. Rather than focusing on just making blanket claims and pushing everything aside, or any such initiatives should understand the inevitability of automated assessment and focus on how to combine it with human evaluation and other means better!

Those that rather made me smile, although I can’t brush aside these things as impossible

“students are subjected to a high-stakes response to their writing by a device that, in fact, cannot read, as even testing firms admit.”

“in machine-scored testing, often students falsely assume that their writing samples will be read by humans with a human’s insightful understanding”

“teachers are coerced into teaching the writing traits that they know the machine will count .. .. and into not teaching the major traits of successful writing.. .. ”

– No I don’t have any specific comments on these. But, in parts, this is very imaginative…and in parts, it is not entirely impossible.

Those that gave new questions:
“conversely, students who knowingly write for a machine are placed in a bind since they cannot know what qualities of writing the machine will react to positively or negatively, the specific algorithms being closely guarded secrets of the testing firms (Frank, 1992; Rubin & O’Looney, 1990)—a bind made worse when their essay will be rated by both a human and a machine”
-as if we know what humans expect! I thought I wrote very a good essay on Sri Sri in my SSC exam’s question “my favourite poet”… and I got less marks (compared to my past performances) in Telugu, of all things. Apparently, teachers in school and those external examiners did not think alike! (or probably the examiner was a SriSri hater!)

“machines also cannot measure authentic audience awareness”
– Who can? Can humans do that with fellow humans? I don’t think so. I know there are people who think I am dumb. There are also those who think I am smart. There are also those who think I am mediocre. Who is right?

Conclusion of this series:

Although I did not do a real research-like reading and this is not some peer-reviewed article series, I spent some time doing this and it has been an enriching experience in terms of the insights it provided on the field of automated assessment and its criticisms.

What I learnt are two things:
* Automated Assessment is necessary and could not be avoided in future (among several reasons, because of the sheer number of students compared to the number of trained evaluators)
* Overcoming the flaws of automated assessments and efficient ways to combine it with trained human evaluators is more important, realistic and challenging than just branding everything as rubbish.

Although the above two are rather obvious, “findings” and the way such “findings” immedietly grab media attention convinced me about the above two things more than before! 🙂

Published in: on May 4, 2013 at 12:40 pm  Leave a Comment  

Automated Grading and Counter Arguments-3

(Continued from part-2. All parts can be seen here)

Finding 5:
machines require artificial essays finished within very short time frames (20-45 minutes) on topics of which student writers have no prior knowledge (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Jones, 2006; Perelman, 2012b; Streeter, Psotka, Laham, & MacCuish, 2002; Wang, & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First, Second, Third, Fourth, 7th references were not freely accessible.

Fifth Report (just search with the title in google): The primary conclusion of this report is: “The automated grading software performed as well as the better instructors in both trials, and well enough to be usefully applied to military instruction. The lower reliabilities observed in these essay sets reflect different instructors applying different criteria in grading these assignments. The statistical models that are created to do automated grading are also limited by the variability in the human grades.”

6th Report
This report studied the correlation between the machines and human raters and concluded, contrary to all previous studies, that there is no significant correlation between them.

While this was an interesting conclusion, I had a few questions which I could not clarify better as those references couldn’t be found online (for free). Here are the questions I have.

1. I think all these Automated Scoring systems work well within their framework. Like, the GRE exam grading works typically for GRE exam like essays but not for say, scoring what I write in my blog. But, they all are really customizable to any such target-specific responses (The NYT article on EdX, which began this series of blog posts from me makes that clear. It needs 100 sample essays and it will learn how to score essays from 101th on, to put it simply.).. So, is it fair to test Intellimetric or whatever system, on some essays that are about something else…and from other domain?

2. As far as I remember, all those other reports used pearson’s correlation. This report used spearman’s correlation. Can we compare the numbers we get directly? Why cant both be included?

-so far except the 4th one, the others seemed to be more about how well the automated scores approximate human judgements (looking at all the abstracts, which are freely available to read!!)

Finding 6: in these short trivial essays, mere length becomes a major determinant of score by both human and machine graders (Chodorow & Burstein, 2004; Perelman, 2012b)

First Report:
I actually found this report to be a very interesting read, for several reasons. Some of them are mentioned below:

“Since February 1999, ETS has used e-rater as one of the two initial readers for the GMAT writing assessments, and, in this capacity, it has scored more than 1 million essays. E-rater’s scores either match or are within one point of the human reader scores about 96% of the time”
-I thought this is an amazing performance (especially with such a high percentage even after grading 1 million essays or more!). I also like this part where they compare the machine-human agreement with human-human agreement, which is why I again think that no single entity should be the sole-scorer (there should be either 2 humans, 2 differently built machines or machine-human combinations).

While I am not completely comfortable with an automated system being the sole scorer in the low-stakes high school exams etc, I have the same feeling about human evaluators too. I think bias can exist both in a human and in a machine (in a machine, because its intelligence is limited. It learns only what it sees). I guess at that level, you always have an option to go back and demand a re-evaluation?

Primary conclusion of this report was that: “In practical terms, e-rater01 differs from human readers by only a very small amount in exact agreement, and it is indistinguishable from human readers in adjacent agreement. But despite these similarities, human readers and e-rater are not the same. When length is removed, human readers share more variance than e-rater01 shares with HR.”
-I think this is what the use to come up with this finding. I do not know how to interpret this particular finding although it disappoints me a bit. Yet, the ETS report made a very interesting read.

A second conclusion that this “finding” does not focus on, but I found interesting, was the effect of the test taker’s Native Language on the overall scores. “There is a main effect for language in the data of the current study. The mixed and Spanish groups have higher scores on average than the Arabic and Japanese. These
differences remain even when length is removed. For most of the prompts, e-rater01 shows the same pattern of differences across native language groups as HR, even with length differences partialed out. Future work should include additional language groups to see if these results generalize.”
– After the recent Native Language Identification shared task, I got really curious on this very topic – the interaction between the native language and the scores that the test takers get. I think this is something that I might need to study further, irrespective of the disappointments during these readings!

Again, only the first report (from ETS) is freely accessible. If the purpose of was to actually convey their message to a “curious” commoner, I think they are repeatedly failing in this by providing such inaccessible references. :-(….

Finding 7: machines are not able to approximate human scores for essays that do fit real-world writing conditions; instead, machines fail badly in rating essays written in these situations (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Condon, 2013; Elliot, Deess, Rudniy, & Joshi, 2012; Jones, 2006; Perelman, 2012b; Powers, Burstein, Chodorow, Fowles, & Kukich, 2002; Streeter, Psotka, Laham, & MacCuish, 2002; Wang & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First report, second, third, fifth, sixth, seventh, tenth reports are not freely accessible (although looking at the abstracts, I did not get an impression that these reports are against automated scoring in any way!).

A version of 4th report is available. Even this did not give me an impression that it has anything against automated scoring as such.. although there is a critical analysis of where automated scoring fails and it ended with a call for better systems.

(8th and 9th reports are already discussed in Finding 5)

-Again, I understand that this is an issue … but I don’t understand how just avoiding automated scoring will solve it.

Finding 8: high correlations between human scores and machine scores reported by testing firms are achieved, in part, when the testing firms train the humans to read like the machine, for instance, by directing the humans to disregard the truth or accuracy of assertions (Perelman, 2012b), and by requiring both machines and humans to use scoring scales of extreme simplicity

– Like I said before, I could not find a freely accessible page for (Perelman, 2012b). It would have been great if some freely accessible report was provided as additional reading here… because this claim/finding is among those most intriguing ones. First, they say that the machines are bad. Now, they say even the humans are bad! Should we be evaluated at all or not? 😛

Conclusion for this part:
… And so, although its getting more and more disappointing… I decided to continue with these reports further because, even if the analysis and those claims (of are flawed according to me, I find those “findings” (or whatever they are) worth pondering about. There may not be a real solution. But, some of these issues need some thinking on the part of people who are capable of doing it better (e.g., teachers and researchers in education, researchers in automated scoring).

(to be continued..)

Published in: on April 27, 2013 at 1:31 pm  Comments (1)  

Automated grading and counter arguments-2

(Continued from part-1)

So, I started to read their research findings.
(It would have been nice if they also gave online links to those references. Nevertheless, it was not very difficult to google most of them).

Finding 1: computer algorithms cannot recognize the most important qualities of good writing, such as truthfulness, tone, complex organization, logical thinking, or ideas new and germane to the topic (Byrne, Tang, Truduc, & Tang, 2010)

(This report referred here can be read here)
In this, after a brief discussion on the development of eGrader, an automatic grading system, they conclude with the following words, after deciding to not use it for automatic grading.
“The machine reader appears to penalize those students we want to nurture, those who think and write in original or different ways. For us, the subjective element which was as important as the objective aspects of the essays, proved too complex to measure.”

-It is indeed a matter of concern (this above statement) …. but, I wondered if there really was sufficient evidence (at least in this referred report) to extrapolate this observation with a small sample of 33 students to phrase the Finding 1.

10 students in this report came back to the human evaluators for a regrading and got better grades after that. But, won’t this sort of disagreement also happen if there are human evaluators instead of machines? Can we be sure that two humans always give the same score to a document without disagreement? What will we do when there is a disagreement between two human scores? May be we call a third evaluator? Isn’t that what they do with machines too now?

Also, I guess automated scoring, wherever its used in a high stakes assessment scenario (like…grading in an entrance exam or in some competitive proficiency test etc), is not used as a sole decider. It seems to be coupled with atleast one human judge (both GRE and GMAT automatic assessment has atleast one human judge and a second one will be called if there is a disagreement between human and machine scores).

So, what I understood: the “finding 1” might still be true … but this document that is referred there is not related to that statement. Its more like a “null hypothesis” than a “finding”.

Finding 2. to measure important writing skills, machines use algorithms that are so reductive as to be absurd: sophistication of vocabulary is reduced to the average length or relative infrequency of words, or development of ideas is reduced to average sentences per paragraph (Perelman, 2012b; Quinlan, Higgins, & Wolff, 2009)

(I could not trace a freely accessible link to the first reference. Second one here).

Its amazing that the same system that uses so many different measures to estimate various aspects of the language quality of an essay (see the last 4 pages of the pdf above) uses a relatively “surface” measure of lexical sophistication and development (No, I don’t have any better ways to measure it. Don’t ask me such questions!!). However, I think “prompt specific vocabulary usage” description in the end actually handles the sophistication of vocabulary part to some extent. And there is always atleast one human grader to identify things like “out of the box” thinking or novel word usages that are relevant and also competent. These automatic assessment systems don’t seem to be the sole decision makers anyway!

So, I don’t understand the issue, again. Further, I am amazed that so many other positive things from this second report were completely ignored and directly a “finding 2” was finalized by skipping all those!

Finding 3 machines over-emphasize grammatical and stylistic errors (Cheville, 2004) yet miss or misidentify such errors at intolerable rates (Herrington & Moran, 2012)

(The first reference does not seem to be freely available.)
-It was clear from the beginning that the authors of the second paper have a strong bias against automated scoring. Its entirely acceptable .. everyone can have some opinion. But, then, we can’t expect that paper to be objective then, can we? Atleast I did not find it objective. I thought there would be some form of well defined scientific study there. But, all they discussed was one single student essay and generalized that to the performance of the automated system as a whole (I wonder why that is not possible to do even the other way round by finding an essay where it works and conclude equally confidently that automated system is the best! :P). Further, I felt that an analysis as to why the automated system showed those errors was not performed. A major criticism about spelling was that the machine identified words like “texting”, “i.e.” as spelling errors. But, if “i.e.” was supposed to be written as “i.e.,” and texting is not a word in say, English dictionary, I think this should be expected. Infact, I would guess that a very conservative human evaluator too might point these things out. So, based on this single student essay analysis (some of which is debatable), its concluded that the tool should be banished from classrooms… (this is where I start getting disillusioned… should I really continue these readings?? They seem so biased and the analysis seems more like over-generalization than science anyway!)

The only interesting point in this report was about the machine’s bias towards standard American English. I am curious to know more on this aspect. Actually, I did find the starting premise (that the machines overemphasize stylistic errors) interesting… but this report did not live up to its own promise in terms of the quality of analysis provided.

Finding 4 machines cannot score writing tasks long and complex enough to represent levels of writing proficiency or performance acceptable in school, college, or the workplace (Bennett, 2006; Condon, 2013; McCurry, 2010; Perelman, 2012a)

Report 1 – Bennett, 2006: There are two major claims in this. 1) “Larger differences in computer familiarity between students with the same paper writing proficiency would be associated with correspondingly bigger discrepancies in computer writing scores” and 2) “The weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing expert”
-Actually, although 1) seems rather obvious (as we need to “type” essays on a computer), there is no real solution that this report proposes. 2) Ofcourse, the weighting would be different between humans and machines. Machines don’t learn like humans and humans don’t learn like machines! But, when there is not so much of discrepancy between both their scores…and so long as the users are satisfied, I guess this is something that we can live with. Anyway, the paper did not suggest any better alternative.

Condon, 2013; McCurry, 2010
are not freely accessible.

Report4 – Perelman, 2012a: The conclusion of this report is that the way automated essay scoring systems evaluate the language constructs is not actually the way actual writing teachers do it. Although this is an important point to be addressed,
the other person in me is always ready to say: what we learn and what the machine learn need not always be the same. Our routes to reach the same conclusion might really not cross at all!

Concluding this part:
In a sense the whole thing is really amazing. On one hand, they talk about the shortage of human evaluators to grade student scripts. On the other hand, they want a ban on automated assessment. I wonder what exactly is the point of, even after reading so many reports! I don’t understand their solution, if there is any, yet.

The other claims and their associated reports might help (I hope!)

(To be continued)

Published in: on April 13, 2013 at 11:11 pm  Leave a Comment  

Automated grading and Counter arguments-1

Take 1: I read this NYT article about EdX’s announcement that it will release its automatic grading software “free on the web, to any institution that wants to use it”.(Article can be read here)

I particularly also liked this part of the statement:

“The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.”

Take 2: There is this organization called According to the news report, “The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.”

Given my recent interest on “evaluation of evaluation”, this particular statement from one of the people from this group caught my attention:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.”

Take 3: I end up navigating through the pages of and read their reports and conclusions, purely because of the above statement.

Take 4: This post comes out. … and seems to become a couple of posts soon.
So, the’s major claim is:
“We call for schools, colleges, and educational assessment programs to stop using computer scoring of student essays written during high-stakes tests.”

As someone not doing anything directly with automated scoring but having academic interest in it owing to its proximity to what I do, I was naturally curious after seeing such a strong statement.

At this point, I have to state what I think about it. I think automated scoring is a nice complimentary system to have, along with human evaluators. This is also why I like the GRE/GMAT-AWA section style scoring model. For example, here is what they say on the ETS website, about GRE essay scoring:

“For the Analytical Writing section, each essay receives a score from at least one trained reader, using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. The essay score is then reviewed by e-rater, a computerized program developed by ETS, which is used to monitor the human reader. If the e-rater evaluation and the human score agree, the human score is used as the final score. If they disagree by a certain amount, a second human score is obtained, and the final score is the average of the two human scores.”
(Link with more detailed explanation here)

Also, in the context of MOOCs and the sheer number of students that enrol in them, perhaps, its a worthwhile idea to explore ways of evaluating them better. Surely, when you are offering courses for free or minimal charges and you have thousands of students, you cannot afford to manually grade each and every student test script. I do like the idea of peer-reviewed essay grading too, though.

Coming back to the topic, the homepage continued:

Independent and industry studies show that by its nature computerized essay rating is

* trivial, rating essays only on surface features such as word size, topic vocabulary, and essay length
* reductive, handling extended prose written only at a grade-school level
* inaccurate, missing much error in student writing and finding much error where it does not exist
* undiagnostic, correlating hardly at all with subsequent writing performance
* unfair, discriminating against minority groups and second-language writers
* secretive, with testing companies blocking independent research into their products

-It is here, that I began feeling … “not true… not true…something is missing”… reasons? While for some of them, I would need a more detailed reading, I was surprised to see some of the other points above:

trivial: I did spend some time in the past few months, reading published (and peer-reviewed) research on this kind of systems (e-rater, for example) and at least I feel that its not really “trivial”. We can always argue –
a) “this is not how a human mind does it”
b) “there is much more than what you do now”.
(I am reminded of the Norvig-Chomsky debate as I write these above two lines!)
But, IMHO, we still cannot call the current state-of-the-art “trivial”. If it is so trivial, why is it that so many researchers still spend major part of their working hours on handling this problem?

unfair: Even if I start believing that its true, I don’t understand how we can be so sure that a human evaluator too won’t do this?

secretive: On this part, I partly agree. But, these days, there are so many competitions on automated assessment (eg: 2012 automated essay scoring competition by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013, Question Answering for Machine Reading Evaluation at CLEF 2013) and people from the industry also participate in these competitions, as far as I noticed. So, although one might still not be able to see the actual working of the respective company products or their actual student texts (hello, they are companies and like many other companies, they too have some proprietary stuff!), these competitions actually provide a scope for open research, fair play and also a scope to explore various dimensions of automated essay scoring. After just browsing through some of these things, I really can’t call things trivial… secretive, they may be… stupid, they certainly are not!

So… I ended up reading their “research findings” as well. As I started reading some of the references, I understood the power of selective reporting, again! By selectively reporting what we choose to report, we can always turn everything in our favor… and this realization perhaps is making me write these posts. 🙂


PS 1: What qualification do you have? someone might ask me. I have the necessary background to understand research documents on this topic. I did some hobby experiments with this sort of stuff, with a freely accessible exam dataset and have an idea of what works and why it works when it works. I never worked on any real automated scoring system. I have no more interest than this on this topic, at least as of now.

Published in: on April 6, 2013 at 8:38 pm  Comments (8)