Comments on the Editorial of “Machine Learning For Science and Society” issue

For whatever reason, I am more fascinated by the applied aspects of any research and Machine Learning (ML) is not an exception. While I use machine learning approaches in my work and studied basics during my masters (.. and on and off during my PhD now), I never found much information on what happens to all the hundreds of new algorithms proposed every year. How many of them actually get used by non-ML researchers working on some other problem? How many of them get used by others who want to solve some real-world problems?

I attended the Machine learning summer school in 2013, where, for two weeks, I was fortunate enough to listen to some of the best researchers in the field speak about ML in general and their work in particular. However, I got a feeling that the community is not so keen on a reality check about the applicability of these algorithms. So, basically, the questions remained.

Machine learning that matters” (Kiri Wagstaff, 2012) is an article I keep thinking about whenever this sort of discussion comes up with fellow grad-students. (My thoughts on it here). In the past few days, there have been a lot of short online/offline discussions about how an effort to do more evaluation on real-world scenarios/datasets is perceived by reviewers in various academic conferences (disclaimer: these discussions are not exclusively about ML but some of the people in these discussions happen to be grad-students working in ML).
We, with our own shortcomings and limitations drew some conclusions (which are not of interest to anyone perhaps) and I was reminded of another inspiring article that I thought about several times in the past few months.

The Article: Machine learning for science and society (Editorial)
Authors: Cynthia Rudin and Kiri L. Wagstaff
Details: Machine Learning (2014) 95:1–9
Url here

This article is an editorial for a special issue of Machine Learning Journal called “Machine Learning For Science and Society“. The issue is a collection of research papers that tackle some real life problems ranging from water pipe condition assessment to online-advertising through ML based approaches. While I did not go through all the papers in this edition yet, I think the editorial is worth a read to any person having a remote curiosity about the phrase “Machine Learning”.

It discusses the issues that arise when you decide to study the real-life impact of ML- What exactly counts as evaluation from the applied perspective? How much of this evaluation differs based on the application domain? How do domain experts see ML – do they look for a great model or a good model that is interpretable? How does the ML community see such research? What is ML good for? What is the need for this special focused issue at all? etc.,

I will not go on and on like this, but I would like to quote a few things from the paper, hoping its not a copyright violation.

The abstract:

“The special issue on “Machine Learning for Science and Society” showcases machine learning work with influence on our current and future society. These papers addressseveral key problems such as how we perform repairs on critical infrastructure, how we predict severe weather and aviation turbulence, how we conduct tax audits, whether we can detect privacy breaches in access to healthcare data, and how we link individuals across census data sets for new insights into population changes. In this introduction, we discuss the need for such a special issue within the context of our field and its relationship to the broader world. In the era of “big data,” there is a need for machine learning to address important large-scale applied problems, yet it is difficult to find top venues in machine learning where such work is encouraged. We discuss the ramifications of this contradictory situation and encourage further discussion on the best strategy that we as a field may adopt. We also summarize key lessons learned from individual papers in the special issue so that the community as a whole can benefit.”

Then, the four points starting from: “If applied research is not considered publishable in top ML venues, our field faces the following disadvantages:”

1. “We lose the flow of applied problems necessary for stimulating relevant theoretical work ….”
2. “We further exacerbate the gap between theoretical work and practice. …”
3. “We may prevent truly new applications of ML to be published in top venues at all (ML or not). …”
4. “We strongly discourage applied research by machine learning professionals. … “

(Read the relevant section in the paper for details.)

The paragraph that followed, where examples of a few applications of ML were mentioned:

“The editors of this special issue have worked on both theoretical and applied topics, where the applied topics between us include criminology (Wang et al. 2013), crop yield prediction (Wagstaff et al. 2008), the energy grid (Rudin et al. 2010, 2012), healthcare (Letham et al. 2013b; McCormick et al. 2012), information retrieval (Letham et al. 2013a), interpretable models (Letham et al. 2013b; McCormick et al. 2012; Ustun et al. 2013), robotic space exploration (Castano et al. 2007; Wagstaff and Bornstein 2009; Wagstaff et al. 2013b), and scientific discovery (Wagstaff et al. 2013a).”

Last, but not the least, the comments on inter-disciplinary research just had such an amount of resounding truth in them that I put the quote up in my room and a few others did the same in the inter-disciplinary grad school I am a part of. :-)

“..for a true interdisciplinary collaboration, both sides need to understand each other’s specialized terminology and together develop the definition of success for the project. We ourselves must be willing to acquire at least apprentice-level expertise in the domain at hand to develop the data and knowledge discovery process necessary for achieving success. ”

This has been one of those articles which I thought about again and again… kept recommending to people working in areas as diverse as psychology, sociology, computer science etc., to people who are not into academic research at all! :-) (I wonder what these people think of me for sending the “seemingly unrelated” article to read though.)

P.S.: It so happens that an ML article inspired me to write this post. But, on a personal front, the questions posed in the first paragraph remain the same even for my own field of research – Computational Linguistics and perhaps to any other field too.

P.S. 2: This does not mean I have some fantastic solution to solve the dilemmas of all senior researchers and grad students who are into inter-disciplinary and/or applied research and at the same time don’t want to perish since they can’t publish in the conferences/journals of their main field.

Published in: on July 8, 2014 at 3:15 pm  Leave a Comment  

Notes from EACL2014

(This is a note taking post. It may not be of particular interest to anyone)


I was at EACL 2014 this week, in Gothenburg, Sweden. I am yet to give a detailed reading to most of the papers that interested me, but I thought its a good idea to list down things.

I attended the PITR workshop and noticed that there are more number of interested people both in the authors and audience compared to last year. Despite the inconclusive panel discussion, I found the whole event interesting and stimulating primarily because of the diversity of topics presented. There seems to be an increasing interest in performing eye-tracking experiments for this task. Some papers that particularly interested me:

One Step Closer to Automatic Evaluation of Text Simplification Systems by Sanja Štajner, Ruslan Mitkov and Horacio Saggion

An eye-tracking evaluation of some parser complexity metrics – Matthew J. Green

Syntactic Sentence Simplification for FrenchLaetitia Brouwers, Delphine Bernhard, Anne-Laure Ligozat and Thomas Francois

An Open Corpus of Everyday Documents for Simplification TasksDavid Pellow and Maxine Eskenazi

An evaluation of syntactic simplification rules for people with autism – Richard Evans, Constantin Orasan and Iustin Dornescu

(If anyone came till here and is interested in any of these papers, they are all open-access and can be found online by searching with the name)


Moving on to the main conference papers,  I am listing here everything that piqued my interest, right from papers I know only by titles for the moment to those for which I heard the authors talk about the work.

Parsing, Machine Translation etc.,

* Is Machine Translation Getting Better over Time? – Yvette Graham; Timothy Baldwin; Alistair Moffat; Justin Zobel

* Improving Dependency Parsers using Combinatory Categorial Grammar-Bharat Ram Ambati; Tejaswini Deoskar; Mark Steedman

* Generalizing a Strongly Lexicalized Parser using Unlabeled Data- Tejaswini Deoskar; Christos Christodoulopoulos; Alexandra Birch; Mark Steedman

* Special Techniques for Constituent Parsing of Morphologically Rich Languages – Zsolt Szántó; Richárd Farkas

* The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation- Daniel Ortiz-Martínez; Francisco Casacuberta

* Joint Morphological and Syntactic Analysis for Richly Inflected Languages – Bernd Bohnet, Joakim Nivre, Igor Bogulavsky, Richard Farkas, Filip Ginter and Jan Hajic

* Fast and Accurate Unlexicalized parsing via Structural Annotations – Maximilian Schlund, Michael Luttenberger and Javier Esparza

Information Retrieval, Extraction stuff:

* Temporal Text Ranking and Automatic Dating of Text – Vlad Niculae; Marcos Zampieri; Liviu Dinu; Alina Maria Ciobanu

* Easy Web Search Results Clustering: When Baselines Can Reach State-of-the-Art Algorithms – Jose G. Moreno; Gaël Dias


* Now We Stronger than Ever: African-American English Syntax in Twitter- Ian Stewart

* Chinese Native Language Identification – Shervin Malmasi and Mark Dras

* Data-driven language transfer hypotheses – Ben Swanson and Eugene Charniak

* Enhancing Authorship Attribution by utilizing syntax tree profiles – Michael Tschuggnall and Günter Specht

* Machine reading tea leaves: Automatically Evaluating Topic Coherence and Topic model quality by Jey Han Lau, David Newman and Timothy Baldwin

* Identifying fake Amazon reviews as learning from crowds – Tommaso Fornaciari and Massimo Poesio

* Using idiolects and sociolects to improve word predictions – Wessel Stoop and Antal van den Bosch

* Expanding the range of automatic emotion detection in microblogging text – Jasy Suet Yan Liew

* Answering List Questions using Web as Corpus – Patricia Gonçalves; Antonio Branco

* Modeling unexpectedness for irony detection in twitter – Francesco Barbieri and Horacio Saggion

* SPARSAR: An Expressive Poetry reader – Rodolfo Delmonte and Anton Maria Prati

* Redundancy detection in ESL writings – Huichao Xue and Rebecca Hwa

* Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules – Advaith Siddharthan and Angrosh Mandya

* Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under length constraints – Annie Louis and Ani Nenkova

* Automatic Detection and Language Identification of Multilingual Document – Marco Lui, Jey Han Lau and Timothy Baldwin

Now, in the coming days, I should atleast try to read the intros and conclusions of some of these papers. :-)

Published in: on May 2, 2014 at 3:10 pm  Leave a Comment  

“Linguistically Naive != Language Independent” and my soliloquy

This post is about a paper that I read today (which inspired me to write a real blog post after months!)

The paper: Linguistically Naive!= Language Independent: Why NLP Needs Linguistic Typology
Author: Emily Bender
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics, pages 26–32. ACL.

In short, this is a position paper, that argues that incorporating linguistic knowledge is a must if we want to create truly language independent NLP systems. Now, on the surface, that looks like a contradictory statement. Well, it isn’t ..and it is common sense, in.. er..some sense ;)

So, time for some background: an NLP algorithm that offers a solution to some problem is called language independent if that approach can work for any other language apart from the language for which it was initially developed. One common example can be Google Translate. It is a practical example of how an approach can work across multiple language pairs (with varying efficiencies ofcourse, but that is different). The point of these language independent approaches is that, in theory, you can just apply the algorithm on any language as long as you have the relevant data about that language. However, typically, such approaches in contemporary research eliminate any linguistic knowledge in their modeling and there by make it “language” independent.

Now, what the paper argues for is clear from the title – “linguistically naive != language independent”.

I liked the point made in section-2, where in some cases, the surface appearance of language independence is actually a hidden language dependence. The specific example of ngrams and how efficiently they work, albeit for languages with certain kind of properties, and the claim of language independence – that nailed down the point. Over a period of time, I became averse to the idea of using n-grams for each and every problem, as I thought this is not giving any useful insights neither from a linguistic nor from a computational perspective (This is my personal opinion). However, although I did think of this language dependent aspect of n-grams, I never clearly put it this way and I just accepted that “language independence” claim. Now, this paper changed that acceptance. :-)

One good thing about this paper is that it does not stop there. It also explains about approaches that use language modeling but does slightly more than ngrams to accommodate various types of languages (factored language models) and also talks about how a “one size fits all” approach won’t work. There is this gem of a statement:

“A truly language independent system works equally well across languages. When a system that is meant to be language independent does not in fact work equally well across languages, it is likely because something about the system design is making implicit assumptions about language structure. These assumptions are typically the result of “overfitting” to the original development language(s).”

Now, there is this section on language independence claims and representation of languages belonging to various families in the papers of ACL 2008. This concludes saying:
“Nonetheless, to the extent that language independence is an important goal, the field needs to improve both its testing of language independence and its sampling of languages to test against.”

Finally, the paper talks about one form of linguistic knowledge that can be incorporated in linguistic systems – linguistic typology and gives pointers to some useful resources and relevant research in this direction.

And I too conclude the post with the two main points that I hope people noticed in the research community:

(1) “This paper has briefly argued that the best way to create language-independent systems is to include linguistic knowledge, specifically knowledge about the ways in which languages vary in their structure. Only by doing so can we ensure that our systems are not overfitted to the development languages.”

(2) “Finally, if the field as a whole values language independence as a property of NLP systems, then we should ensure that the languages we select to use in evaluations are representative of both the language types and language families we are interested in.”

Good paper and considerable amount of food for thought! These are important design considerations, IMHO.

The extended epilogue:

At NAACL-2012, there was this tutorial titled “100 Things You Always Wanted to Know about Linguistics But Were Afraid to Ask“, by Emily Bender. At that time, although I in theory could have attended the conference, I could not, as I had to go to India. But, this was one tutorial that caught my attention with its name and description and I really wanted to attend it.

Thanks to a colleague who attended, I managed to see the slides of the tutorial (which I later saw on the professor’s website). Last week, during some random surfing, I realized that an elaborate version was released as a book:

Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax
by Emily Bender
Pub: Synthesis Lectures on Human Language Technologies, Morgan and Claypool Publishers

I happily borrowed the book using the inter-library loan and it traveled for a few days and reached me from somewhere in Lower Saxony to here in Baden-Württemburg. Just imagine, it travelled all the way just for my sake! ;) :P

So, I started to go through the book. I, even in the days of absolute lack of any basic knowledge on this field, always felt that natural language processing should involve some form of linguistic modeling by default. However, most of the successful so-called “language independent” approaches (some of which also became the products we use regularly, like Google Translate and Transliterate) never speak about such linguistic modeling (atleast, not many that I read).

There is also this Norvig vs Chomsky debate, about which I keep getting reminded of when I think of this topic. (Neither of them are wrong in my view but that is not the point here.)

In this context, I found the paper particularly worth sharing. Anyway, I perhaps should end the post. While reading the introductory parts of Emily Bender’s book, I found a reference to the paper, and this blog post came out of that reading experience.

Published in: on January 23, 2014 at 5:04 pm  Comments (2)  

MLSS 2013 – Week 1 recap

I am attending this year’s Machine Learning Summer School and we just finished one week of lectures. I thought now is the moment to look back and note down my thoughts (mainly because we thankfully don’t have lectures on sundays!). One more week to go and I am already very glad that I am here listening to all these amazing people who are undoubtedly some of the best researchers in this area. There is also a very vibrant and smart student community.

Until Saturday evening, my thoughts on the summer school focused more on the content of the sessions. They were mostly about the mathematics in the sessions, my comfort and discomfort with it, their relevance, understanding the conceptual basis of it etc., I won’t make claims that I understood everything. I understood some talks better, some talks not at all. I also understood that things could have been much better for me if we were informed about why we need to actually seriously follow all the Engineering Mathematics courses during my bachelors ;).

However, coming to the point, as I listened to the Multilayer Nets lecture by Leon Bottou on Saturday afternoon, there was something that I found particularly striking. It looks like two things that I always thought of as possibly interesting aspects of Machine Learning are not really a part of the real machine learning community. (Okay, one summer school is not a whole community but I did meet some people who have been in that field of research for years now).

1) What exactly are you giving as input for the machine to learn? Shouldn’t we give the machine proper input for it to learn what we expect it to learn?

2) Why isn’t the interpretability of a model an issue worth researching about?

Let me elaborate on these.

Coming to the first one, this is called “Feature Engineering”. The answer that I heard from one senior researcher for this question was: “We build algorithms that will enable the machine to learn from anything. Features are not our problem. The machine will figure that out.” But, won’t the machine need the right eco-system for that? If I grow up in a Telugu speaking household and get exposed to Telugu input all the time, will I be expected to learn Telugu or Chinese? Likewise, if we want to construct a model that does a specific task, is it not our responsibility to prepare the input for that? Okay, we can build systems that figure out the features that work by itself. But won’t that make the machine learn anything from the several possible problem subspaces, instead of the specific issue we want it to learn? Yes, there are always ways to assess if its learning the right thing. But, thats not the point. In a way, this connects again to the second question.

Am not knowledgeable enough on this field to come up with a well-argued response to that above comment by the senior researcher. The matter of fact is also that there is enough evidence that that approach does work in some scenarios. But, this is a general question on the applicability of the models, issues regarding domain adaptation if any etc. I found so less literature on theoretical aspects connecting feature engineering to algorithm design and hence these basic doubts.

The second question is also something that I have been thinking about for a long time now. Are people really not bothered about how those who apply Machine Learning in their fields interpret their models or am I bad at searching for the right things? Why is there no talk about the interpretability of models? I did find a small amount of literature on “Human comprehensible machine learning” and related research, but not much.

I am still in the process of thinking, reading and understanding more on this topic. I will perhaps write another detailed post soon (with whatever limited awareness I have on this topic). But, in the mean while,

* Here is a blogpost by a grad student, that has some valid points on interpretability of models.

* “Machine Learning that matters“, ICML 2012 position paper by Kiri Wagstaff. This is something that I keep getting back to time and again, whenever I get into thinking about these topics. Not that the paper answers my questions.. it keeps me motivated to think on them.

* An older blogpost on the above paper which had some good discussion in the comments section.

With these thoughts, we march towards the second week of awesomeness at MLSS 2013 :-).

Published in: on September 1, 2013 at 3:31 pm  Comments (1)  

Notes from ACL

This is the kind of post that would not interest anyone else except me perhaps. I was at ACL (in a very interesting city called Sofia, the capital of Bulgaria) last week and I am still in the process of making some notes on the papers that interested me, abstracts that raised my curiosity, short and long term interest topics etc. I thought its probably a better idea to arrange atleast the titles in some subgroups and save somewhere so that it would be easy for me to get back later. I did not read all of them completely. Infact, for a few of them, I did not even go beyond the abstract. So, don’t ask me questions. Anyone who is interested in any of these titles can either read them by googling for them or visit the ACL anthology page for ACL’13 and find the pdfs there.

The first two sections below are my current topics of interest. The third one is a general topic of interest. The fourth one includes everything else…that piqued my interest. Fifth section is on teaching CL/NLP…which is also a long term interest topic for me. The final section is about workshops as a whole that I have interest in.


Various dimensions of the notion of text difficulty, readability
* Automatically predicting sentence translation difficulty – Mishra and Bhattacharya
* Automatic detection of deception in child produced speech using syntactic complexity features – Yancheva and Rudzicz
* Simple, readable sub sentences – Klerke and Sogaard
* Improving text simplification language modeling using Unsimplified text data – Kauchak
* Typesetting for improved readability using lexical and syntactic information – Salama
* What makes writing great?: First experiments on Article quality prediction in the science journalism domain, Louis and Nenkova
* Word surprisal predicts N400 amplitude during reading – Frank
* An analysis of memory based processing costs using incremental deep syntactic dependency parsing – Schjindel

Language Learning, Assessment etc.
* Discriminative Approach to fill-in-the-blank quiz generation for language learners
* Modeling child divergences from Adult Grammar with Automatic Error Correction
* Automated collocation suggestion for Japanese second language learners
* Reconstructing an Indo-European family tree from non-native English texts
* Word association profiles and their use for automated scoring of essays -Klebanov and Flor.
* Grammatical error correction using Integer Linear programming
* A learner corpus based approach to verb suggestion for ESL
* Modeling thesis clarity in student essays – Persing & Ng
* Computerized analysis of a verbal fluency test – Szumlanski
* Exploring word class n-grams to measure language development in children. Ramirez-de-la-Rosa

NLP for other languages:
* Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison – Esmaili and Salavati
* Identifying English and Hungarian light verb constructions: A contrastive approach – Vincze
* Real-world semi-supervised learning of POS taggers for low-resource languages -Garrette
* Learning to lemmatize Polish noun phrases – Radziszewski
* Sentence level dialect identification in Arabic – Elfardy and Diab

* Exploring Word Order Universals: a probabilisitic graphical model approach – Xia Lu.
* An opensource toolkit for quantitative historical linguists
* SORT: An improved source rewriting tool for improved translation
* unsupervised consonant-vowel prediction over hundreds of languages
* Linguistic models for analyzing and detecting biased language.
* Earlier identification of Epilepsy surgery candidates using natural language processing – Matykiewicz
* Parallels between linguistics and biology. Chakraborti and Tendulkar
* Analysing lexical consistency in translation – Guillou
* Associative texture is lost in translation – Klebanov and Flor

Teaching CL, NLP:
* Artificial IntelliDance: Teaching Machine learning through choreography, Agarwal and Trainor
* Treebanking for data-driven research in the classroom, Lee
* Learning computational linguistics through NLP evaluation events: the experience of Russian evaluation initiative. Bonch-Osmolovskaya
* Teaching the basics of NLP and ML in an introductory course to Information Science. Agarwal.

whole workshops and competitions:
* Shared task on quality estimation in Machine translation
* Predicting and improving textual readability for target reader populations (PITR 2013)

Published in: on August 14, 2013 at 9:09 am  Leave a Comment  

Automated Grading and Counter Arguments-4 (last)

(Continued from Part-3)

Finding 9: machine scoring shows a bias against second-language writers (Chen & Cheng, 2008) and minority writers such as Hispanics and African Americans (Elliot, Deess, Rudniy, & Joshi., 2012]

Report-1: The best part about this report is the stance it takes. It immediately got me interested in it.

“Given the fact that many AWE programs have already been in use and involve multiple stakeholders, a blanket rejection of these products may not be a viable, practical stand.

A more pressing question, accordingly, is probably not whether AWE should be used but how this new technology can be used to achieve more desirable learning outcomes while avoiding potential harms that may result from limitations inherent in the technology.”

-This exactly has been my problem with these statements on website.

The study primarily supports the idea of integrating human and machine assessments, by taking advantage of good things in both, as mentioned below:

“The AWE implementation was viewed comparatively more favorably when the program was used to facilitate students’ early drafting and revising process, and when the teacher made a policy of asking students to meet a preliminary required standard and subsequently provided human feedback. The integration of automated assessment and human assessment for formative learning offers three advantages….”

At least I did not find a direct mention about a “bias” against second language writers in this report! We need to stretch our imagination a bit to reach that conclusion!

Report-2: – The second report was already mentioned in Finding 7. Like before, I did not find a direct relevance of these results to this “finding”. However, I see the point in raising this issue. But, what I don’t understand is that this is just like some American coming and correcting Indian English :P So, this kind of “bias” can exist in humans as well. What really is a way to handle this, manually or automatically? This does not make the case favourable to human assessment (IMHO).

Finding 10: for all these reasons, machine scores predict future academic success abysmally (Mattern & Packman, 2009; Matzen & Hoyt, 2004; Ramineni & Williamson, 2013 – not freely accessible)

– I actually did not go through these references beyond their intro and conclusion sections as I felt that “finding” is too much of a blanket statement to be connected to these findings. Skimming through the two freely accessible reports among these confirmed my suspicion. These reports focus more on doing a critical analysis of automated systems and suggesting ways to improve them and combine them with some other things… and not say “machine scores predict future academic success abysmally”.

Part-2 of these findings were focused on the statement: “machine scoring does not measure, and therefore does not promote, authentic acts of writing”

While I intend to stop here (partly because its so time consuming to go through so many reports and partly because of a growing feeling of irritation with these claims), some of these findings-part 2 made me think and some made me smile and some made me fire a question back…

Those that made me think:
“students who know that they are writing only for a machine may be tempted to turn their writing into a game, trying to fool the machine into producing a higher score, which is easily done ”
-This is something that recurs in my thoughts each time I think of automated assessment.. and I know it is not super-difficult to fool the machine in some cases.

“as a result, the machine grading of high-stakes writing assessments seriously degrades instruction in writing (Perelman, 2012a), since teachers have strong incentives to train students in the writing of long verbose prose, the memorization of lists of lengthy and rarely used words, the fabrication rather than the researching of supporting information, in short, to dumb down student writing.”
– This part is actually the main issue. Rather than focusing on just making blanket claims and pushing everything aside, or any such initiatives should understand the inevitability of automated assessment and focus on how to combine it with human evaluation and other means better!

Those that rather made me smile, although I can’t brush aside these things as impossible

“students are subjected to a high-stakes response to their writing by a device that, in fact, cannot read, as even testing firms admit.”

“in machine-scored testing, often students falsely assume that their writing samples will be read by humans with a human’s insightful understanding”

“teachers are coerced into teaching the writing traits that they know the machine will count .. .. and into not teaching the major traits of successful writing.. .. ”

– No I don’t have any specific comments on these. But, in parts, this is very imaginative…and in parts, it is not entirely impossible.

Those that gave new questions:
“conversely, students who knowingly write for a machine are placed in a bind since they cannot know what qualities of writing the machine will react to positively or negatively, the specific algorithms being closely guarded secrets of the testing firms (Frank, 1992; Rubin & O’Looney, 1990)—a bind made worse when their essay will be rated by both a human and a machine”
-as if we know what humans expect! I thought I wrote very a good essay on Sri Sri in my SSC exam’s question “my favourite poet”… and I got less marks (compared to my past performances) in Telugu, of all things. Apparently, teachers in school and those external examiners did not think alike! (or probably the examiner was a SriSri hater!)

“machines also cannot measure authentic audience awareness”
– Who can? Can humans do that with fellow humans? I don’t think so. I know there are people who think I am dumb. There are also those who think I am smart. There are also those who think I am mediocre. Who is right?

Conclusion of this series:

Although I did not do a real research-like reading and this is not some peer-reviewed article series, I spent some time doing this and it has been an enriching experience in terms of the insights it provided on the field of automated assessment and its criticisms.

What I learnt are two things:
* Automated Assessment is necessary and could not be avoided in future (among several reasons, because of the sheer number of students compared to the number of trained evaluators)
* Overcoming the flaws of automated assessments and efficient ways to combine it with trained human evaluators is more important, realistic and challenging than just branding everything as rubbish.

Although the above two are rather obvious, “findings” and the way such “findings” immedietly grab media attention convinced me about the above two things more than before! :-)

Published in: on May 4, 2013 at 12:40 pm  Leave a Comment  

Automated Grading and Counter Arguments-3

(Continued from part-2. All parts can be seen here)

Finding 5:
machines require artificial essays finished within very short time frames (20-45 minutes) on topics of which student writers have no prior knowledge (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Jones, 2006; Perelman, 2012b; Streeter, Psotka, Laham, & MacCuish, 2002; Wang, & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First, Second, Third, Fourth, 7th references were not freely accessible.

Fifth Report (just search with the title in google): The primary conclusion of this report is: “The automated grading software performed as well as the better instructors in both trials, and well enough to be usefully applied to military instruction. The lower reliabilities observed in these essay sets reflect different instructors applying different criteria in grading these assignments. The statistical models that are created to do automated grading are also limited by the variability in the human grades.”

6th Report
This report studied the correlation between the machines and human raters and concluded, contrary to all previous studies, that there is no significant correlation between them.

While this was an interesting conclusion, I had a few questions which I could not clarify better as those references couldn’t be found online (for free). Here are the questions I have.

1. I think all these Automated Scoring systems work well within their framework. Like, the GRE exam grading works typically for GRE exam like essays but not for say, scoring what I write in my blog. But, they all are really customizable to any such target-specific responses (The NYT article on EdX, which began this series of blog posts from me makes that clear. It needs 100 sample essays and it will learn how to score essays from 101th on, to put it simply.).. So, is it fair to test Intellimetric or whatever system, on some essays that are about something else…and from other domain?

2. As far as I remember, all those other reports used pearson’s correlation. This report used spearman’s correlation. Can we compare the numbers we get directly? Why cant both be included?

-so far except the 4th one, the others seemed to be more about how well the automated scores approximate human judgements (looking at all the abstracts, which are freely available to read!!)

Finding 6: in these short trivial essays, mere length becomes a major determinant of score by both human and machine graders (Chodorow & Burstein, 2004; Perelman, 2012b)

First Report:
I actually found this report to be a very interesting read, for several reasons. Some of them are mentioned below:

“Since February 1999, ETS has used e-rater as one of the two initial readers for the GMAT writing assessments, and, in this capacity, it has scored more than 1 million essays. E-rater’s scores either match or are within one point of the human reader scores about 96% of the time”
-I thought this is an amazing performance (especially with such a high percentage even after grading 1 million essays or more!). I also like this part where they compare the machine-human agreement with human-human agreement, which is why I again think that no single entity should be the sole-scorer (there should be either 2 humans, 2 differently built machines or machine-human combinations).

While I am not completely comfortable with an automated system being the sole scorer in the low-stakes high school exams etc, I have the same feeling about human evaluators too. I think bias can exist both in a human and in a machine (in a machine, because its intelligence is limited. It learns only what it sees). I guess at that level, you always have an option to go back and demand a re-evaluation?

Primary conclusion of this report was that: “In practical terms, e-rater01 differs from human readers by only a very small amount in exact agreement, and it is indistinguishable from human readers in adjacent agreement. But despite these similarities, human readers and e-rater are not the same. When length is removed, human readers share more variance than e-rater01 shares with HR.”
-I think this is what the use to come up with this finding. I do not know how to interpret this particular finding although it disappoints me a bit. Yet, the ETS report made a very interesting read.

A second conclusion that this “finding” does not focus on, but I found interesting, was the effect of the test taker’s Native Language on the overall scores. “There is a main effect for language in the data of the current study. The mixed and Spanish groups have higher scores on average than the Arabic and Japanese. These
differences remain even when length is removed. For most of the prompts, e-rater01 shows the same pattern of differences across native language groups as HR, even with length differences partialed out. Future work should include additional language groups to see if these results generalize.”
– After the recent Native Language Identification shared task, I got really curious on this very topic – the interaction between the native language and the scores that the test takers get. I think this is something that I might need to study further, irrespective of the disappointments during these readings!

Again, only the first report (from ETS) is freely accessible. If the purpose of was to actually convey their message to a “curious” commoner, I think they are repeatedly failing in this by providing such inaccessible references. :-(….

Finding 7: machines are not able to approximate human scores for essays that do fit real-world writing conditions; instead, machines fail badly in rating essays written in these situations (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Condon, 2013; Elliot, Deess, Rudniy, & Joshi, 2012; Jones, 2006; Perelman, 2012b; Powers, Burstein, Chodorow, Fowles, & Kukich, 2002; Streeter, Psotka, Laham, & MacCuish, 2002; Wang & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First report, second, third, fifth, sixth, seventh, tenth reports are not freely accessible (although looking at the abstracts, I did not get an impression that these reports are against automated scoring in any way!).

A version of 4th report is available. Even this did not give me an impression that it has anything against automated scoring as such.. although there is a critical analysis of where automated scoring fails and it ended with a call for better systems.

(8th and 9th reports are already discussed in Finding 5)

-Again, I understand that this is an issue … but I don’t understand how just avoiding automated scoring will solve it.

Finding 8: high correlations between human scores and machine scores reported by testing firms are achieved, in part, when the testing firms train the humans to read like the machine, for instance, by directing the humans to disregard the truth or accuracy of assertions (Perelman, 2012b), and by requiring both machines and humans to use scoring scales of extreme simplicity

– Like I said before, I could not find a freely accessible page for (Perelman, 2012b). It would have been great if some freely accessible report was provided as additional reading here… because this claim/finding is among those most intriguing ones. First, they say that the machines are bad. Now, they say even the humans are bad! Should we be evaluated at all or not? :P

Conclusion for this part:
… And so, although its getting more and more disappointing… I decided to continue with these reports further because, even if the analysis and those claims (of are flawed according to me, I find those “findings” (or whatever they are) worth pondering about. There may not be a real solution. But, some of these issues need some thinking on the part of people who are capable of doing it better (e.g., teachers and researchers in education, researchers in automated scoring).

(to be continued..)

Published in: on April 27, 2013 at 1:31 pm  Comments (1)  

Automated grading and counter arguments-2

(Continued from part-1)

So, I started to read their research findings.
(It would have been nice if they also gave online links to those references. Nevertheless, it was not very difficult to google most of them).

Finding 1: computer algorithms cannot recognize the most important qualities of good writing, such as truthfulness, tone, complex organization, logical thinking, or ideas new and germane to the topic (Byrne, Tang, Truduc, & Tang, 2010)

(This report referred here can be read here)
In this, after a brief discussion on the development of eGrader, an automatic grading system, they conclude with the following words, after deciding to not use it for automatic grading.
“The machine reader appears to penalize those students we want to nurture, those who think and write in original or different ways. For us, the subjective element which was as important as the objective aspects of the essays, proved too complex to measure.”

-It is indeed a matter of concern (this above statement) …. but, I wondered if there really was sufficient evidence (at least in this referred report) to extrapolate this observation with a small sample of 33 students to phrase the Finding 1.

10 students in this report came back to the human evaluators for a regrading and got better grades after that. But, won’t this sort of disagreement also happen if there are human evaluators instead of machines? Can we be sure that two humans always give the same score to a document without disagreement? What will we do when there is a disagreement between two human scores? May be we call a third evaluator? Isn’t that what they do with machines too now?

Also, I guess automated scoring, wherever its used in a high stakes assessment scenario (like…grading in an entrance exam or in some competitive proficiency test etc), is not used as a sole decider. It seems to be coupled with atleast one human judge (both GRE and GMAT automatic assessment has atleast one human judge and a second one will be called if there is a disagreement between human and machine scores).

So, what I understood: the “finding 1” might still be true … but this document that is referred there is not related to that statement. Its more like a “null hypothesis” than a “finding”.

Finding 2. to measure important writing skills, machines use algorithms that are so reductive as to be absurd: sophistication of vocabulary is reduced to the average length or relative infrequency of words, or development of ideas is reduced to average sentences per paragraph (Perelman, 2012b; Quinlan, Higgins, & Wolff, 2009)

(I could not trace a freely accessible link to the first reference. Second one here).

Its amazing that the same system that uses so many different measures to estimate various aspects of the language quality of an essay (see the last 4 pages of the pdf above) uses a relatively “surface” measure of lexical sophistication and development (No, I don’t have any better ways to measure it. Don’t ask me such questions!!). However, I think “prompt specific vocabulary usage” description in the end actually handles the sophistication of vocabulary part to some extent. And there is always atleast one human grader to identify things like “out of the box” thinking or novel word usages that are relevant and also competent. These automatic assessment systems don’t seem to be the sole decision makers anyway!

So, I don’t understand the issue, again. Further, I am amazed that so many other positive things from this second report were completely ignored and directly a “finding 2” was finalized by skipping all those!

Finding 3 machines over-emphasize grammatical and stylistic errors (Cheville, 2004) yet miss or misidentify such errors at intolerable rates (Herrington & Moran, 2012)

(The first reference does not seem to be freely available.)
-It was clear from the beginning that the authors of the second paper have a strong bias against automated scoring. Its entirely acceptable .. everyone can have some opinion. But, then, we can’t expect that paper to be objective then, can we? Atleast I did not find it objective. I thought there would be some form of well defined scientific study there. But, all they discussed was one single student essay and generalized that to the performance of the automated system as a whole (I wonder why that is not possible to do even the other way round by finding an essay where it works and conclude equally confidently that automated system is the best! :P). Further, I felt that an analysis as to why the automated system showed those errors was not performed. A major criticism about spelling was that the machine identified words like “texting”, “i.e.” as spelling errors. But, if “i.e.” was supposed to be written as “i.e.,” and texting is not a word in say, English dictionary, I think this should be expected. Infact, I would guess that a very conservative human evaluator too might point these things out. So, based on this single student essay analysis (some of which is debatable), its concluded that the tool should be banished from classrooms… (this is where I start getting disillusioned… should I really continue these readings?? They seem so biased and the analysis seems more like over-generalization than science anyway!)

The only interesting point in this report was about the machine’s bias towards standard American English. I am curious to know more on this aspect. Actually, I did find the starting premise (that the machines overemphasize stylistic errors) interesting… but this report did not live up to its own promise in terms of the quality of analysis provided.

Finding 4 machines cannot score writing tasks long and complex enough to represent levels of writing proficiency or performance acceptable in school, college, or the workplace (Bennett, 2006; Condon, 2013; McCurry, 2010; Perelman, 2012a)

Report 1 – Bennett, 2006: There are two major claims in this. 1) “Larger differences in computer familiarity between students with the same paper writing proficiency would be associated with correspondingly bigger discrepancies in computer writing scores” and 2) “The weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing expert”
-Actually, although 1) seems rather obvious (as we need to “type” essays on a computer), there is no real solution that this report proposes. 2) Ofcourse, the weighting would be different between humans and machines. Machines don’t learn like humans and humans don’t learn like machines! But, when there is not so much of discrepancy between both their scores…and so long as the users are satisfied, I guess this is something that we can live with. Anyway, the paper did not suggest any better alternative.

Condon, 2013; McCurry, 2010
are not freely accessible.

Report4 – Perelman, 2012a: The conclusion of this report is that the way automated essay scoring systems evaluate the language constructs is not actually the way actual writing teachers do it. Although this is an important point to be addressed,
the other person in me is always ready to say: what we learn and what the machine learn need not always be the same. Our routes to reach the same conclusion might really not cross at all!

Concluding this part:
In a sense the whole thing is really amazing. On one hand, they talk about the shortage of human evaluators to grade student scripts. On the other hand, they want a ban on automated assessment. I wonder what exactly is the point of, even after reading so many reports! I don’t understand their solution, if there is any, yet.

The other claims and their associated reports might help (I hope!)

(To be continued)

Published in: on April 13, 2013 at 11:11 pm  Leave a Comment  

Automated grading and Counter arguments-1

Take 1: I read this NYT article about EdX’s announcement that it will release its automatic grading software “free on the web, to any institution that wants to use it”.(Article can be read here)

I particularly also liked this part of the statement:

“The EdX assessment tool requires human teachers, or graders, to first grade 100 essays or essay questions. The system then uses a variety of machine-learning techniques to train itself to be able to grade any number of essays or answers automatically and almost instantaneously.”

Take 2: There is this organization called According to the news report, “The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.”

Given my recent interest on “evaluation of evaluation”, this particular statement from one of the people from this group caught my attention:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.”

Take 3: I end up navigating through the pages of and read their reports and conclusions, purely because of the above statement.

Take 4: This post comes out. … and seems to become a couple of posts soon.
So, the’s major claim is:
“We call for schools, colleges, and educational assessment programs to stop using computer scoring of student essays written during high-stakes tests.”

As someone not doing anything directly with automated scoring but having academic interest in it owing to its proximity to what I do, I was naturally curious after seeing such a strong statement.

At this point, I have to state what I think about it. I think automated scoring is a nice complimentary system to have, along with human evaluators. This is also why I like the GRE/GMAT-AWA section style scoring model. For example, here is what they say on the ETS website, about GRE essay scoring:

“For the Analytical Writing section, each essay receives a score from at least one trained reader, using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. The essay score is then reviewed by e-rater, a computerized program developed by ETS, which is used to monitor the human reader. If the e-rater evaluation and the human score agree, the human score is used as the final score. If they disagree by a certain amount, a second human score is obtained, and the final score is the average of the two human scores.”
(Link with more detailed explanation here)

Also, in the context of MOOCs and the sheer number of students that enrol in them, perhaps, its a worthwhile idea to explore ways of evaluating them better. Surely, when you are offering courses for free or minimal charges and you have thousands of students, you cannot afford to manually grade each and every student test script. I do like the idea of peer-reviewed essay grading too, though.

Coming back to the topic, the homepage continued:

Independent and industry studies show that by its nature computerized essay rating is

* trivial, rating essays only on surface features such as word size, topic vocabulary, and essay length
* reductive, handling extended prose written only at a grade-school level
* inaccurate, missing much error in student writing and finding much error where it does not exist
* undiagnostic, correlating hardly at all with subsequent writing performance
* unfair, discriminating against minority groups and second-language writers
* secretive, with testing companies blocking independent research into their products

-It is here, that I began feeling … “not true… not true…something is missing”… reasons? While for some of them, I would need a more detailed reading, I was surprised to see some of the other points above:

trivial: I did spend some time in the past few months, reading published (and peer-reviewed) research on this kind of systems (e-rater, for example) and at least I feel that its not really “trivial”. We can always argue –
a) “this is not how a human mind does it”
b) “there is much more than what you do now”.
(I am reminded of the Norvig-Chomsky debate as I write these above two lines!)
But, IMHO, we still cannot call the current state-of-the-art “trivial”. If it is so trivial, why is it that so many researchers still spend major part of their working hours on handling this problem?

unfair: Even if I start believing that its true, I don’t understand how we can be so sure that a human evaluator too won’t do this?

secretive: On this part, I partly agree. But, these days, there are so many competitions on automated assessment (eg: 2012 automated essay scoring competition by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013, Question Answering for Machine Reading Evaluation at CLEF 2013) and people from the industry also participate in these competitions, as far as I noticed. So, although one might still not be able to see the actual working of the respective company products or their actual student texts (hello, they are companies and like many other companies, they too have some proprietary stuff!), these competitions actually provide a scope for open research, fair play and also a scope to explore various dimensions of automated essay scoring. After just browsing through some of these things, I really can’t call things trivial… secretive, they may be… stupid, they certainly are not!

So… I ended up reading their “research findings” as well. As I started reading some of the references, I understood the power of selective reporting, again! By selectively reporting what we choose to report, we can always turn everything in our favor… and this realization perhaps is making me write these posts. :-)


PS 1: What qualification do you have? someone might ask me. I have the necessary background to understand research documents on this topic. I did some hobby experiments with this sort of stuff, with a freely accessible exam dataset and have an idea of what works and why it works when it works. I never worked on any real automated scoring system. I have no more interest than this on this topic, at least as of now.

Published in: on April 6, 2013 at 8:38 pm  Comments (8)  

Machine Learning that Matters – Some thoughts.

Its almost an year since Praneeth sent this paper and I read it…and began blogging about it. I began re-reading it today, as a part of my “evaluating the evaluation” readings, and thought I still have something to say (largely to myself) on some of the points mentioned in this paper.

Machine Learning that Matters
by Kiri L. Wagstaff
Published in proceedings of ICML 2012.

This is how it begins:

“Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society”

-I guess the tone and intention of this paper is pretty clear in this first sentence.

I don’t have any issues with the tone as such – but I thought there are so many real-world applications of machine learning these days! That doesn’t mean that every machine learning research problem leads to solving a real-world problem though, which holds good for any research. So, the above statement in my view can apply to any research in general.

I was fascinated by this statistics on the hyper-focus on bench marked datasets.

A survey of the 152 non-cross-conference papers published at ICML 2011 reveals:
148/152 (93%) include experiments of some sort
57/148 (39%) use synthetic data
55/148 (37%) use UCI data
34/148 (23%) use ONLY UCI and/or synthetic data
1/148 (1%) interpret results in domain context

-Since I am not into machine learning research but only use ML for computational linguistics problems, I found this to be very interesting… and a very valid point.

Then, the discussion moves on to evaluation metrics:

“These metrics are abstract in that they explicitly ignore or remove problem-specific details, usually so that numbers can be compared across domains. Does this seemingly obvious strategy provide us with useful information?”

-In the discussion that followed, there were some interesting points on what various evaluation metrics fail to capture etc. I have been reading on this topic of evaluation metrics for supervised machine learning in the recent past…and like with those, I am left with the same question even here – what is the best evaluation, then? Ofcourse, “real world”. But, how do you quantify that? How can there be some kind of evaluation metric, thats truly comparable with other peer research groups?

I got my answer in the later part of the paper:

Yet (as noted earlier) the common approach of using the same metric for all domains relies on an unstated, and usually unfounded, assumption that it is possible to equate an x% improvement in one domain with that in another. Instead, if the same method can yield profit improvements of $10,000 per year for an auto-tire business as well as the avoidance of 300 unnecessary surgical interventions per year, then it will have demonstrated a powerful, wide-ranging utility.

Next part of the discussion is on identifying where machine learning matters:

“It is very hard to identify a problem for which machine learning may offer a solution, determine what data should be collected, select or extract relevant features, choose an appropriate learning method, select an evaluation method, interpret the results, involve domain experts, publicize the results to the relevant scientific community, persuade users to adopt the technique, and (only then) to truly have made a difference”

-Now, I like that. :-) :-)

I also liked this point on the involvement of the world outside ML.

“We could also solicit short “Comment” papers, to accompany the publication of a new ML advance, that are authored by researchers with relevant domain expertise but who were uninvolved with the ML research. They could provide an independent assessment of the performance, utility, and impact of the work. As an additional benefit, this informs new communities about how, and how well, ML methods work.”

“Finally, we should consider potential impact when selecting which research problems to tackle, not merely how interesting or challenging they are from the ML perspective. How many people, species, countries, or square meters would be impacted by a solution to the problem? What level of performance would constitute a meaningful improvement over the status quo?”

-Well, I personally share the sentiments expressed here. I like and I want to work on problems whose solutions can possibly have a real life impact. However, I consider it my personal choice. But, I don’t understand what is wrong in doing something because its challenging! What’s wrong in researching for fact finding? There will be practical implications to certain research problems. There might not be an immediate impact for some. There might not be a direct impact for some. There might not really be a practical impact for some. But should that be the only deciding factor? (Well, of course, when the researchers are funded from public taxes, perhaps its expected to be thus. But, should it be thus, always??)

I found the six old and new Machine learning impact challenges really interesting.
Here are the new ones from the paper:

1. A law passed or legal decision made that relies on the result of an ML analysis.
2. $100M saved through improved decision making provided by an ML system.
3. A conflict between nations averted through high-quality translation provided by an ML system.
4. A 50% reduction in cybersecurity break-ins through ML defenses.
5. A human life saved through a diagnosis or intervention recommended by an ML system.
6. Improvement of 10% in one country’s Human Development Index (HDI) (Anand & Sen,1994) attributable to an ML system.

And finally, I found the last discussion on obstacles to ML impact also to be very true. I don’t know why there is so little work making machine learning output comprehensible to its users (e.g., doctors using a classifier to identify certain traits in a patient might not really want to see an SVM output and take a decision without understanding the output!) (atleast, I did not find too much work on Human Comprehensible Machine Learning)

As I read it again and again, this paper seems to me like a Theory vs Practice debate (generally speaking) and can possibly be worth reading for anyone outside machine learning community too (like it was useful for me!).

End disclaimer: All those thoughts expressed are my individual feelings and are not related to my employer.:-)

Published in: on March 26, 2013 at 12:35 pm  Comments (22)  

Get every new post delivered to your Inbox.

Join 103 other followers