Automated grading and Counter arguments-1

Take 1: I read this NYT article about EdX’s announcement that it will release its automatic grading software “free on the web, to any institution that wants to use it”.(Article can be read here)

I particularly also liked this part of the statement:

“The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.”

Take 2: There is this organization called According to the news report, “The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.”

Given my recent interest on “evaluation of evaluation”, this particular statement from one of the people from this group caught my attention:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.”

Take 3: I end up navigating through the pages of and read their reports and conclusions, purely because of the above statement.

Take 4: This post comes out. … and seems to become a couple of posts soon.
So, the’s major claim is:
“We call for schools, colleges, and educational assessment programs to stop using computer scoring of student essays written during high-stakes tests.”

As someone not doing anything directly with automated scoring but having academic interest in it owing to its proximity to what I do, I was naturally curious after seeing such a strong statement.

At this point, I have to state what I think about it. I think automated scoring is a nice complimentary system to have, along with human evaluators. This is also why I like the GRE/GMAT-AWA section style scoring model. For example, here is what they say on the ETS website, about GRE essay scoring:

“For the Analytical Writing section, each essay receives a score from at least one trained reader, using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. The essay score is then reviewed by e-rater, a computerized program developed by ETS, which is used to monitor the human reader. If the e-rater evaluation and the human score agree, the human score is used as the final score. If they disagree by a certain amount, a second human score is obtained, and the final score is the average of the two human scores.”
(Link with more detailed explanation here)

Also, in the context of MOOCs and the sheer number of students that enrol in them, perhaps, its a worthwhile idea to explore ways of evaluating them better. Surely, when you are offering courses for free or minimal charges and you have thousands of students, you cannot afford to manually grade each and every student test script. I do like the idea of peer-reviewed essay grading too, though.

Coming back to the topic, the homepage continued:

Independent and industry studies show that by its nature computerized essay rating is

* trivial, rating essays only on surface features such as word size, topic vocabulary, and essay length
* reductive, handling extended prose written only at a grade-school level
* inaccurate, missing much error in student writing and finding much error where it does not exist
* undiagnostic, correlating hardly at all with subsequent writing performance
* unfair, discriminating against minority groups and second-language writers
* secretive, with testing companies blocking independent research into their products

-It is here, that I began feeling … “not true… not true…something is missing”… reasons? While for some of them, I would need a more detailed reading, I was surprised to see some of the other points above:

trivial: I did spend some time in the past few months, reading published (and peer-reviewed) research on this kind of systems (e-rater, for example) and at least I feel that its not really “trivial”. We can always argue –
a) “this is not how a human mind does it”
b) “there is much more than what you do now”.
(I am reminded of the Norvig-Chomsky debate as I write these above two lines!)
But, IMHO, we still cannot call the current state-of-the-art “trivial”. If it is so trivial, why is it that so many researchers still spend major part of their working hours on handling this problem?

unfair: Even if I start believing that its true, I don’t understand how we can be so sure that a human evaluator too won’t do this?

secretive: On this part, I partly agree. But, these days, there are so many competitions on automated assessment (eg: 2012 automated essay scoring competition by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013, Question Answering for Machine Reading Evaluation at CLEF 2013) and people from the industry also participate in these competitions, as far as I noticed. So, although one might still not be able to see the actual working of the respective company products or their actual student texts (hello, they are companies and like many other companies, they too have some proprietary stuff!), these competitions actually provide a scope for open research, fair play and also a scope to explore various dimensions of automated essay scoring. After just browsing through some of these things, I really can’t call things trivial… secretive, they may be… stupid, they certainly are not!

So… I ended up reading their “research findings” as well. As I started reading some of the references, I understood the power of selective reporting, again! By selectively reporting what we choose to report, we can always turn everything in our favor… and this realization perhaps is making me write these posts. 🙂


PS 1: What qualification do you have? someone might ask me. I have the necessary background to understand research documents on this topic. I did some hobby experiments with this sort of stuff, with a freely accessible exam dataset and have an idea of what works and why it works when it works. I never worked on any real automated scoring system. I have no more interest than this on this topic, at least as of now.

Published in: on April 6, 2013 at 8:38 pm  Comments (8)  

The URI to TrackBack this entry is:

RSS feed for comments on this post.

8 CommentsLeave a comment

  1. kindly provide link for download

    • Link for what?

  2. Sowmya, I don’t have any qualifications 🙂 But I dare to express my opinion.
    First, It is very interesting to see your point of view. The topic itself is interesting one to debate endlessly on.
    Forgive me if I seem repetitive. but I’m reminded of Zen and the art of motorcycle maintenance where he has to ‘define’ ‘quality’ and that starts off a roller coaster ride for him. Mind you, he was a strong believer in reason and analytical thinking. He bends over his head to see ‘reason’ and explain it logically. Then he has a flash of some sorts that sets his creativity and that of his students in turn flowing.( Then he gets carried away and falls over the edge of sanity. He is brought back to ‘sanity’ by interventions. Then he struggles to keep past away from himself and finally makes peace with it as part of his whole self. At least that’s how I understood that. That’s besides the point now).
    I could read your post without being judgmental because you seemed not judgmental yourself, though strong in your opinion. Then of course you did not remove the need for ‘human’. I am glad I read through. I liked the ‘unfair’ part of your opinion on the word ‘trivial’. I should read it again.

    • Lalitha,

      Thank you for reading! When I write such posts, I know what people will be thinking (why can’t she post somewhere else? whats the point? etc)… but I just write them to clear my own thoughts. So, comments are pleasant surprises in such cases 🙂

      Well, I don’t know if these automatic systems can actually replace humans. I (with my limited knowledge on this topic) don’t think its possible at least for sometime to come. I understood from a few readings that there are certain weaker points in the way a machine evaluates human performance… which makes it impossible to rely on its decision alone. (And I also understand that this humanreaders website focus completely on these weak points and forget that a whole world exists beyond those weak points). So despite being a supporter of research in to this topic, I cannot remove the need for professional human evaluator.

      Thanks again 🙂

  3. I mean, read your post again.

  4. As a teacher, I view intra course assessment as a process of evaluation for suggestion and encouragement , the terminal assessment is for overall grading.
    Removing the human touch may make it unbiased , but why should the human interface be removed in the first place?

    • C V R Mohan: The question of “why”, I think, is not something that I can answer. It will pose a question within a question within a question and eventually end up in the ultimate question – “why machines?”.

      Also, in the case of MOOCs, there are hundreds of thousands of people taking a course, sometimes. Assuming that one course has one professor and a bunch of TAs…and also keeping in mind that in many cases these courses are still free (it wont count as a high stakes assessment in this case) – some form of automated evaluation might be necessary in some cases.

      And in the case of high stake assessments, people are not talking about removing the human touch completely (atleast not at this moment). No one claims that removing the human touch makes it unbiased (Actually, claim that the machine is biased against second language writers).

      More than anything else, in high stakes exams like GRE etc… I don’t think the organizations are so foolish to continue, if they believed that automated grading is not efficient. Even here, (incase you did not read the article fully), there is always atleast one human evaluator.

  5. […] from part-1) […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: