Take 1: I read this NYT article about EdX’s announcement that it will release its automatic grading software “free on the web, to any institution that wants to use it”.(Article can be read here)
I particularly also liked this part of the statement:
“The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.”
Take 2: There is this organization called humanreaders.org. According to the news report, “The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.”
Given my recent interest on “evaluation of evaluation”, this particular statement from one of the people from this group caught my attention:
“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.”
Take 3: I end up navigating through the pages of humanreaders.org and read their reports and conclusions, purely because of the above statement.
Take 4: This post comes out. … and seems to become a couple of posts soon.
So, the humanreaders.org’s major claim is:
“We call for schools, colleges, and educational assessment programs to stop using computer scoring of student essays written during high-stakes tests.”
As someone not doing anything directly with automated scoring but having academic interest in it owing to its proximity to what I do, I was naturally curious after seeing such a strong statement.
At this point, I have to state what I think about it. I think automated scoring is a nice complimentary system to have, along with human evaluators. This is also why I like the GRE/GMAT-AWA section style scoring model. For example, here is what they say on the ETS website, about GRE essay scoring:
“For the Analytical Writing section, each essay receives a score from at least one trained reader, using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. The essay score is then reviewed by e-rater, a computerized program developed by ETS, which is used to monitor the human reader. If the e-rater evaluation and the human score agree, the human score is used as the final score. If they disagree by a certain amount, a second human score is obtained, and the final score is the average of the two human scores.”
(Link with more detailed explanation here)
Also, in the context of MOOCs and the sheer number of students that enrol in them, perhaps, its a worthwhile idea to explore ways of evaluating them better. Surely, when you are offering courses for free or minimal charges and you have thousands of students, you cannot afford to manually grade each and every student test script. I do like the idea of peer-reviewed essay grading too, though.
Coming back to the topic, the humanwriters.org homepage continued:
Independent and industry studies show that by its nature computerized essay rating is
* trivial, rating essays only on surface features such as word size, topic vocabulary, and essay length
* reductive, handling extended prose written only at a grade-school level
* inaccurate, missing much error in student writing and finding much error where it does not exist
* undiagnostic, correlating hardly at all with subsequent writing performance
* unfair, discriminating against minority groups and second-language writers
* secretive, with testing companies blocking independent research into their products
-It is here, that I began feeling … “not true… not true…something is missing”… reasons? While for some of them, I would need a more detailed reading, I was surprised to see some of the other points above:
trivial: I did spend some time in the past few months, reading published (and peer-reviewed) research on this kind of systems (e-rater, for example) and at least I feel that its not really “trivial”. We can always argue –
a) “this is not how a human mind does it”
b) “there is much more than what you do now”.
(I am reminded of the Norvig-Chomsky debate as I write these above two lines!)
But, IMHO, we still cannot call the current state-of-the-art “trivial”. If it is so trivial, why is it that so many researchers still spend major part of their working hours on handling this problem?
unfair: Even if I start believing that its true, I don’t understand how we can be so sure that a human evaluator too won’t do this?
secretive: On this part, I partly agree. But, these days, there are so many competitions on automated assessment (eg: 2012 automated essay scoring competition by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013, Question Answering for Machine Reading Evaluation at CLEF 2013) and people from the industry also participate in these competitions, as far as I noticed. So, although one might still not be able to see the actual working of the respective company products or their actual student texts (hello, they are companies and like many other companies, they too have some proprietary stuff!), these competitions actually provide a scope for open research, fair play and also a scope to explore various dimensions of automated essay scoring. After just browsing through some of these things, I really can’t call things trivial… secretive, they may be… stupid, they certainly are not!
So… I ended up reading their “research findings” as well. As I started reading some of the references, I understood the power of selective reporting, again! By selectively reporting what we choose to report, we can always turn everything in our favor… and this realization perhaps is making me write these posts.🙂
PS 1: What qualification do you have? someone might ask me. I have the necessary background to understand research documents on this topic. I did some hobby experiments with this sort of stuff, with a freely accessible exam dataset and have an idea of what works and why it works when it works. I never worked on any real automated scoring system. I have no more interest than this on this topic, at least as of now.