language independence in NLP – some thoughts

In the last session of our reading group, we discussed the following article:

title: “On achieving and Evaluating language independence in NLP”
author: Emily Bender
Linguistic Issues in Language Technology, 2011.
url here
The article is an extended version of a 2009 writeup, about which I wrote here.

To summarize in few words, the article first discusses what does language independence in natural language processing system development mean – in theory and in practice. Then, taking linguistic typology as a source of knowledge, it suggests some do’s and don’ts for NLP researchers working for the development of language independent systems. I liked the idea that true language independence is possible only through incorporation of linguistic knowledge into the system design. It took me only a few seconds to convinced about it when I read that 2009 paper and my opinion did not change in the meanwhile. My experience in working with non-English language datasets in the meanwhile only boosted the opinion.

Reading the article with a bunch of people with a linguistic and not an engineering background this time gave me some new perspective. One most important thing I noticed is this: I think I can say it is fairly common among CS based NLP communities to claim language independence by assuming that the approach that works on one or two languages, several times closely related ones, will work on any other language. I never knew what linguists think about that. The linguists in our group first wondered how can anyone claim language independence in general and how difficult is it to claim language independence. We even briefly went into a philosophical discussion. As someone who started with NLP in a CS department, I should confess I never even thought of it like this until now. People in so many NLP papers claim language independence in an off-hand manner.. and I suddenly started seeing why it could be a myth. That is the “aha” moment for that day.

Anyway, coming back to the paper, after the section 4 where there are the Do’s and Don’ts, I found the section 5 incomplete. This is an attempt to explain how Computational Linguistics is useful in typology and vice-versa – but I did not get a complete picture. There were a couple of recent papers which test the applicability of their approaches on multiple languages from different language families (two examples I can think of from 2015 are: Soricot and Och, 2015 and Müller and Schütze, 2015)

Nevertheless, it is a very well written article and is a must read for anyone who wondered if all the claims of language independence are really true and if there is no implicit considerations that favor some language over the other in the development of natural language processing systems.

Thanks to Maria, Simon, Xiaobin and Marti for all a very interesting discussion!

Published in: on November 21, 2015 at 6:50 pm  Comments (1)  

Automatic question generation for measuring comprehension – some thoughts

In our weekly/fortnightly reading group here, we spent most of the past 2 months discussing about “automatic question generation”. We discussed primarily NLP papers but included a couple of educational research papers as well. NLP papers usually focus on the engineering aspects of the system and are usually heavy on computations required. Educational research papers primarily focus on performing user studies with some approach of question creation and then correlating the user performance with these questions to text comprehension. So, these are usually low on the computational part. That is roughly the difference between the two kinds of articles we chose to read and discuss.

Now, as we progressed with these discussions, and as more people (with diverse backgrounds) joined the group for a couple of sessions, I realized that I am learning to see things in different perspectives. I am now writing this post to summarize what I thought about these articles, what I learnt through these discussions, and what I think about the whole idea of automatic question generation at this point of time. I will give pointers to the relevant articles. Most of them are freely accessible. Leave a comment if you want something you see here and can’t get access. Questions here were of two kinds – factual questions from the text (Who did what to whom kind of things) and fill in the blank kind of questions where one of the key words go missing.

Let me start with a summary of the stuff we discussed in the past few weeks:
a) We first started with the generation of factual questions from any text i.e., the purpose of the system here is to generate questions like – “When was Gandhi born? Where was Gandhi born?” etc., from a biography page on Mahatma Gandhi. Here, we primarily discussed the approach followed by Michael Heilman. More details about the related articles and the released code can be seen here. Here, the primary focus of the approach has been to generate grammatically correct questions.

b) We then moved to more recent work from Microsoft Research, published in 2015, where the task of “generating the questions” is transformed by using crowd sourcing to create question templates. So, the primary problem here is to replicate the human judgements of relevant question templates for a given text, by drawing inferences about the category of the content in a particular section of text through machine learning. (I am trying to summarize in one sentence, but someone wanting to know more please read the article). The resource created and the features used to infer category/section will eventually be released here.

c) At this time, after a slight digression into the cognitive and psycholinguistic aspects of gap filling probabilities, we got into an article which manually designed a fill-in-the-blank kind of test which allegedly measures reading comprehension. They concluded that such kind of tests are quick to create, take less time to test, and still do what you want out of such test (i.e., understand how much the readers understood).

d) Naturally, the next question for us was: “How can we generate the best gaps automatically?”. Amidst a couple of articles we explored, we again picked an older article from Microsoft Research for discussion. This is about deciding what gaps in a sentence are the best to test the “key concepts” in texts. Again, the approach relies on crowd sourcing to get these judgements from human raters first, and then develops a machine learning approach to replicate this. The data thus created, and some details about the machine learning approach implementation can be found here.

Now, my thoughts on the topic in general:
a) To be able to generate real “comprehension” testing questions from any possible text, we should make sure that we are not falsefully ending up testing the ability of a reader to remember the text. So, I did not get a clear picture of how fill-in-the-blank questions avoid this pitfall. Generating who? what? kind of questions instead of fill-in-the-blanks perhaps to some extent covers this up. Yet, if these questions only require you to know that one sentence, how are they really measuring comprehension of the whole piece of text, when comprehension can include drawing inferences from multiple parts of the text?

b) One dis-satisfying aspect of all these readings has been that: people who do user-studies don’t talk about the scalability of their method beyond a laboratory setup and people who engineer technological solutions don’t discuss if these approaches are really working with real users in testing their comprehension. I was surprised that several NLP papers I read on the topic in the past weeks (apart from those mentioned above) talk about question generation approaches, evaluate on some dataset about the correctness or relevance of the “questions” generated (be it gap-filling or questions with a question mark). But, I haven’t seen anyone do an evaluation on the possible consumers of such an application. The only exception in my readings has been – Michael Heilman’s PhD thesis, where they evaluated their question generation approach as a possible assisting tool for teachers to prepare questions.

On one hand, I think this is a very interesting topic to work on, with all the possible commercial and not-so-commercial real-life impact it can have in these days of massive online education and non-conventional ways of learning. Clearly, there is a lot of work going on on various ways to generate questions automatically, which is a very useful method to have in such massive learning scenarios. We know what approaches “kind of” work and what don’t, in generating the questions as such. However, I wonder what exactly are we trying to achieve by not doing the final step of user evaluation with these computational approaches. If we do not know whether all the fancy approaches are going really doing what they are supposed to do (testing comprehension of the readers), what is the point? To use a Tennis term, the missing “follow throughfollow through” is a problem for much of this work remaining unusable for the actual consumers of this kind of work – teachers, learners and other such people in a learning environment. I am not a dreamer, so I know the difficulties in working across groups and I can guess the reasons for the missing “follow through” (especially, as someone currently in the academia!).

The only way I see the “follow through” being possible is in an ed-tech company, since they have to do the user evaluation to get going 🙂 Perhaps I should wait and see if new ed-tech startups working on learning analytics and measuring learning outcomes can come up with effective solutions. On that optimistic note, I should perhaps end my post for now.

Acknowledgements: I have benefited a lot from the comments on these papers by Magdalena Wolska, Maria Chinkina, Martí Quixal, Xiaobin Chen and Simón Ruiz, who attended some or all of these meetings in the past few months. Long live discussions! 😉

Published in: on October 22, 2015 at 4:06 pm  Comments (5)  

Comments on the Editorial of “Machine Learning For Science and Society” issue

For whatever reason, I am more fascinated by the applied aspects of any research and Machine Learning (ML) is not an exception. While I use machine learning approaches in my work and studied basics during my masters (.. and on and off during my PhD now), I never found much information on what happens to all the hundreds of new algorithms proposed every year. How many of them actually get used by non-ML researchers working on some other problem? How many of them get used by others who want to solve some real-world problems?

I attended the Machine learning summer school in 2013, where, for two weeks, I was fortunate enough to listen to some of the best researchers in the field speak about ML in general and their work in particular. However, I got a feeling that the community is not so keen on a reality check about the applicability of these algorithms. So, basically, the questions remained.

Machine learning that matters” (Kiri Wagstaff, 2012) is an article I keep thinking about whenever this sort of discussion comes up with fellow grad-students. (My thoughts on it here). In the past few days, there have been a lot of short online/offline discussions about how an effort to do more evaluation on real-world scenarios/datasets is perceived by reviewers in various academic conferences (disclaimer: these discussions are not exclusively about ML but some of the people in these discussions happen to be grad-students working in ML).
We, with our own shortcomings and limitations drew some conclusions (which are not of interest to anyone perhaps) and I was reminded of another inspiring article that I thought about several times in the past few months.

The Article: Machine learning for science and society (Editorial)
Authors: Cynthia Rudin and Kiri L. Wagstaff
Details: Machine Learning (2014) 95:1–9
Url here

This article is an editorial for a special issue of Machine Learning Journal called “Machine Learning For Science and Society“. The issue is a collection of research papers that tackle some real life problems ranging from water pipe condition assessment to online-advertising through ML based approaches. While I did not go through all the papers in this edition yet, I think the editorial is worth a read to any person having a remote curiosity about the phrase “Machine Learning”.

It discusses the issues that arise when you decide to study the real-life impact of ML- What exactly counts as evaluation from the applied perspective? How much of this evaluation differs based on the application domain? How do domain experts see ML – do they look for a great model or a good model that is interpretable? How does the ML community see such research? What is ML good for? What is the need for this special focused issue at all? etc.,

I will not go on and on like this, but I would like to quote a few things from the paper, hoping its not a copyright violation.

The abstract:

“The special issue on “Machine Learning for Science and Society” showcases machine learning work with influence on our current and future society. These papers addressseveral key problems such as how we perform repairs on critical infrastructure, how we predict severe weather and aviation turbulence, how we conduct tax audits, whether we can detect privacy breaches in access to healthcare data, and how we link individuals across census data sets for new insights into population changes. In this introduction, we discuss the need for such a special issue within the context of our field and its relationship to the broader world. In the era of “big data,” there is a need for machine learning to address important large-scale applied problems, yet it is difficult to find top venues in machine learning where such work is encouraged. We discuss the ramifications of this contradictory situation and encourage further discussion on the best strategy that we as a field may adopt. We also summarize key lessons learned from individual papers in the special issue so that the community as a whole can benefit.”

Then, the four points starting from: “If applied research is not considered publishable in top ML venues, our field faces the following disadvantages:”

1. “We lose the flow of applied problems necessary for stimulating relevant theoretical work ….”
2. “We further exacerbate the gap between theoretical work and practice. …”
3. “We may prevent truly new applications of ML to be published in top venues at all (ML or not). …”
4. “We strongly discourage applied research by machine learning professionals. … “

(Read the relevant section in the paper for details.)

The paragraph that followed, where examples of a few applications of ML were mentioned:

“The editors of this special issue have worked on both theoretical and applied topics, where the applied topics between us include criminology (Wang et al. 2013), crop yield prediction (Wagstaff et al. 2008), the energy grid (Rudin et al. 2010, 2012), healthcare (Letham et al. 2013b; McCormick et al. 2012), information retrieval (Letham et al. 2013a), interpretable models (Letham et al. 2013b; McCormick et al. 2012; Ustun et al. 2013), robotic space exploration (Castano et al. 2007; Wagstaff and Bornstein 2009; Wagstaff et al. 2013b), and scientific discovery (Wagstaff et al. 2013a).”

Last, but not the least, the comments on inter-disciplinary research just had such an amount of resounding truth in them that I put the quote up in my room and a few others did the same in the inter-disciplinary grad school I am a part of. 🙂

“..for a true interdisciplinary collaboration, both sides need to understand each other’s specialized terminology and together develop the definition of success for the project. We ourselves must be willing to acquire at least apprentice-level expertise in the domain at hand to develop the data and knowledge discovery process necessary for achieving success. ”

This has been one of those articles which I thought about again and again… kept recommending to people working in areas as diverse as psychology, sociology, computer science etc., to people who are not into academic research at all! 🙂 (I wonder what these people think of me for sending the “seemingly unrelated” article to read though.)

*****
P.S.: It so happens that an ML article inspired me to write this post. But, on a personal front, the questions posed in the first paragraph remain the same even for my own field of research – Computational Linguistics and perhaps to any other field too.

P.S. 2: This does not mean I have some fantastic solution to solve the dilemmas of all senior researchers and grad students who are into inter-disciplinary and/or applied research and at the same time don’t want to perish since they can’t publish in the conferences/journals of their main field.

Published in: on July 8, 2014 at 3:15 pm  Leave a Comment  

Notes from EACL2014

(This is a note taking post. It may not be of particular interest to anyone)

***

I was at EACL 2014 this week, in Gothenburg, Sweden. I am yet to give a detailed reading to most of the papers that interested me, but I thought its a good idea to list down things.

I attended the PITR workshop and noticed that there are more number of interested people both in the authors and audience compared to last year. Despite the inconclusive panel discussion, I found the whole event interesting and stimulating primarily because of the diversity of topics presented. There seems to be an increasing interest in performing eye-tracking experiments for this task. Some papers that particularly interested me:

One Step Closer to Automatic Evaluation of Text Simplification Systems by Sanja Štajner, Ruslan Mitkov and Horacio Saggion

An eye-tracking evaluation of some parser complexity metrics – Matthew J. Green

Syntactic Sentence Simplification for FrenchLaetitia Brouwers, Delphine Bernhard, Anne-Laure Ligozat and Thomas Francois

An Open Corpus of Everyday Documents for Simplification TasksDavid Pellow and Maxine Eskenazi

An evaluation of syntactic simplification rules for people with autism – Richard Evans, Constantin Orasan and Iustin Dornescu

(If anyone came till here and is interested in any of these papers, they are all open-access and can be found online by searching with the name)

 

Moving on to the main conference papers,  I am listing here everything that piqued my interest, right from papers I know only by titles for the moment to those for which I heard the authors talk about the work.

Parsing, Machine Translation etc.,

* Is Machine Translation Getting Better over Time? – Yvette Graham; Timothy Baldwin; Alistair Moffat; Justin Zobel

* Improving Dependency Parsers using Combinatory Categorial Grammar-Bharat Ram Ambati; Tejaswini Deoskar; Mark Steedman

* Generalizing a Strongly Lexicalized Parser using Unlabeled Data- Tejaswini Deoskar; Christos Christodoulopoulos; Alexandra Birch; Mark Steedman

* Special Techniques for Constituent Parsing of Morphologically Rich Languages – Zsolt Szántó; Richárd Farkas

* The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation- Daniel Ortiz-Martínez; Francisco Casacuberta

* Joint Morphological and Syntactic Analysis for Richly Inflected Languages – Bernd Bohnet, Joakim Nivre, Igor Bogulavsky, Richard Farkas, Filip Ginter and Jan Hajic

* Fast and Accurate Unlexicalized parsing via Structural Annotations – Maximilian Schlund, Michael Luttenberger and Javier Esparza

Information Retrieval, Extraction stuff:

* Temporal Text Ranking and Automatic Dating of Text – Vlad Niculae; Marcos Zampieri; Liviu Dinu; Alina Maria Ciobanu

* Easy Web Search Results Clustering: When Baselines Can Reach State-of-the-Art Algorithms – Jose G. Moreno; Gaël Dias

Others:

* Now We Stronger than Ever: African-American English Syntax in Twitter- Ian Stewart

* Chinese Native Language Identification – Shervin Malmasi and Mark Dras

* Data-driven language transfer hypotheses – Ben Swanson and Eugene Charniak

* Enhancing Authorship Attribution by utilizing syntax tree profiles – Michael Tschuggnall and Günter Specht

* Machine reading tea leaves: Automatically Evaluating Topic Coherence and Topic model quality by Jey Han Lau, David Newman and Timothy Baldwin

* Identifying fake Amazon reviews as learning from crowds – Tommaso Fornaciari and Massimo Poesio

* Using idiolects and sociolects to improve word predictions – Wessel Stoop and Antal van den Bosch

* Expanding the range of automatic emotion detection in microblogging text – Jasy Suet Yan Liew

* Answering List Questions using Web as Corpus – Patricia Gonçalves; Antonio Branco

* Modeling unexpectedness for irony detection in twitter – Francesco Barbieri and Horacio Saggion

* SPARSAR: An Expressive Poetry reader – Rodolfo Delmonte and Anton Maria Prati

* Redundancy detection in ESL writings – Huichao Xue and Rebecca Hwa

* Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules – Advaith Siddharthan and Angrosh Mandya

* Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under length constraints – Annie Louis and Ani Nenkova

* Automatic Detection and Language Identification of Multilingual Document – Marco Lui, Jey Han Lau and Timothy Baldwin

Now, in the coming days, I should atleast try to read the intros and conclusions of some of these papers. 🙂

Published in: on May 2, 2014 at 3:10 pm  Leave a Comment  
Tags:

“Linguistically Naive != Language Independent” and my soliloquy

This post is about a paper that I read today (which inspired me to write a real blog post after months!)

The paper: Linguistically Naive!= Language Independent: Why NLP Needs Linguistic Typology
Author: Emily Bender
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics, pages 26–32. ACL.

In short, this is a position paper, that argues that incorporating linguistic knowledge is a must if we want to create truly language independent NLP systems. Now, on the surface, that looks like a contradictory statement. Well, it isn’t ..and it is common sense, in.. er..some sense 😉

So, time for some background: an NLP algorithm that offers a solution to some problem is called language independent if that approach can work for any other language apart from the language for which it was initially developed. One common example can be Google Translate. It is a practical example of how an approach can work across multiple language pairs (with varying efficiencies ofcourse, but that is different). The point of these language independent approaches is that, in theory, you can just apply the algorithm on any language as long as you have the relevant data about that language. However, typically, such approaches in contemporary research eliminate any linguistic knowledge in their modeling and there by make it “language” independent.

Now, what the paper argues for is clear from the title – “linguistically naive != language independent”.

I liked the point made in section-2, where in some cases, the surface appearance of language independence is actually a hidden language dependence. The specific example of ngrams and how efficiently they work, albeit for languages with certain kind of properties, and the claim of language independence – that nailed down the point. Over a period of time, I became averse to the idea of using n-grams for each and every problem, as I thought this is not giving any useful insights neither from a linguistic nor from a computational perspective (This is my personal opinion). However, although I did think of this language dependent aspect of n-grams, I never clearly put it this way and I just accepted that “language independence” claim. Now, this paper changed that acceptance. 🙂

One good thing about this paper is that it does not stop there. It also explains about approaches that use language modeling but does slightly more than ngrams to accommodate various types of languages (factored language models) and also talks about how a “one size fits all” approach won’t work. There is this gem of a statement:

“A truly language independent system works equally well across languages. When a system that is meant to be language independent does not in fact work equally well across languages, it is likely because something about the system design is making implicit assumptions about language structure. These assumptions are typically the result of “overfitting” to the original development language(s).”

Now, there is this section on language independence claims and representation of languages belonging to various families in the papers of ACL 2008. This concludes saying:
“Nonetheless, to the extent that language independence is an important goal, the field needs to improve both its testing of language independence and its sampling of languages to test against.”

Finally, the paper talks about one form of linguistic knowledge that can be incorporated in linguistic systems – linguistic typology and gives pointers to some useful resources and relevant research in this direction.

And I too conclude the post with the two main points that I hope people noticed in the research community:

(1) “This paper has briefly argued that the best way to create language-independent systems is to include linguistic knowledge, specifically knowledge about the ways in which languages vary in their structure. Only by doing so can we ensure that our systems are not overfitted to the development languages.”

(2) “Finally, if the field as a whole values language independence as a property of NLP systems, then we should ensure that the languages we select to use in evaluations are representative of both the language types and language families we are interested in.”

Good paper and considerable amount of food for thought! These are important design considerations, IMHO.

The extended epilogue:

At NAACL-2012, there was this tutorial titled “100 Things You Always Wanted to Know about Linguistics But Were Afraid to Ask“, by Emily Bender. At that time, although I in theory could have attended the conference, I could not, as I had to go to India. But, this was one tutorial that caught my attention with its name and description and I really wanted to attend it.

Thanks to a colleague who attended, I managed to see the slides of the tutorial (which I later saw on the professor’s website). Last week, during some random surfing, I realized that an elaborate version was released as a book:

Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax
by Emily Bender
Pub: Synthesis Lectures on Human Language Technologies, Morgan and Claypool Publishers

I happily borrowed the book using the inter-library loan and it traveled for a few days and reached me from somewhere in Lower Saxony to here in Baden-Württemburg. Just imagine, it travelled all the way just for my sake! 😉 😛

So, I started to go through the book. I, even in the days of absolute lack of any basic knowledge on this field, always felt that natural language processing should involve some form of linguistic modeling by default. However, most of the successful so-called “language independent” approaches (some of which also became the products we use regularly, like Google Translate and Transliterate) never speak about such linguistic modeling (atleast, not many that I read).

There is also this Norvig vs Chomsky debate, about which I keep getting reminded of when I think of this topic. (Neither of them are wrong in my view but that is not the point here.)

In this context, I found the paper particularly worth sharing. Anyway, I perhaps should end the post. While reading the introductory parts of Emily Bender’s book, I found a reference to the paper, and this blog post came out of that reading experience.

Published in: on January 23, 2014 at 5:04 pm  Comments (3)  
Tags:

MLSS 2013 – Week 1 recap

I am attending this year’s Machine Learning Summer School and we just finished one week of lectures. I thought now is the moment to look back and note down my thoughts (mainly because we thankfully don’t have lectures on sundays!). One more week to go and I am already very glad that I am here listening to all these amazing people who are undoubtedly some of the best researchers in this area. There is also a very vibrant and smart student community.

Until Saturday evening, my thoughts on the summer school focused more on the content of the sessions. They were mostly about the mathematics in the sessions, my comfort and discomfort with it, their relevance, understanding the conceptual basis of it etc., I won’t make claims that I understood everything. I understood some talks better, some talks not at all. I also understood that things could have been much better for me if we were informed about why we need to actually seriously follow all the Engineering Mathematics courses during my bachelors ;).

However, coming to the point, as I listened to the Multilayer Nets lecture by Leon Bottou on Saturday afternoon, there was something that I found particularly striking. It looks like two things that I always thought of as possibly interesting aspects of Machine Learning are not really a part of the real machine learning community. (Okay, one summer school is not a whole community but I did meet some people who have been in that field of research for years now).

1) What exactly are you giving as input for the machine to learn? Shouldn’t we give the machine proper input for it to learn what we expect it to learn?

2) Why isn’t the interpretability of a model an issue worth researching about?

Let me elaborate on these.

Coming to the first one, this is called “Feature Engineering”. The answer that I heard from one senior researcher for this question was: “We build algorithms that will enable the machine to learn from anything. Features are not our problem. The machine will figure that out.” But, won’t the machine need the right eco-system for that? If I grow up in a Telugu speaking household and get exposed to Telugu input all the time, will I be expected to learn Telugu or Chinese? Likewise, if we want to construct a model that does a specific task, is it not our responsibility to prepare the input for that? Okay, we can build systems that figure out the features that work by itself. But won’t that make the machine learn anything from the several possible problem subspaces, instead of the specific issue we want it to learn? Yes, there are always ways to assess if its learning the right thing. But, thats not the point. In a way, this connects again to the second question.

Am not knowledgeable enough on this field to come up with a well-argued response to that above comment by the senior researcher. The matter of fact is also that there is enough evidence that that approach does work in some scenarios. But, this is a general question on the applicability of the models, issues regarding domain adaptation if any etc. I found so less literature on theoretical aspects connecting feature engineering to algorithm design and hence these basic doubts.

The second question is also something that I have been thinking about for a long time now. Are people really not bothered about how those who apply Machine Learning in their fields interpret their models or am I bad at searching for the right things? Why is there no talk about the interpretability of models? I did find a small amount of literature on “Human comprehensible machine learning” and related research, but not much.

I am still in the process of thinking, reading and understanding more on this topic. I will perhaps write another detailed post soon (with whatever limited awareness I have on this topic). But, in the mean while,

* Here is a blogpost by a grad student, that has some valid points on interpretability of models.

* “Machine Learning that matters“, ICML 2012 position paper by Kiri Wagstaff. This is something that I keep getting back to time and again, whenever I get into thinking about these topics. Not that the paper answers my questions.. it keeps me motivated to think on them.

* An older blogpost on the above paper which had some good discussion in the comments section.

With these thoughts, we march towards the second week of awesomeness at MLSS 2013 :-).

Published in: on September 1, 2013 at 3:31 pm  Comments (1)  
Tags:

Notes from ACL

This is the kind of post that would not interest anyone else except me perhaps. I was at ACL (in a very interesting city called Sofia, the capital of Bulgaria) last week and I am still in the process of making some notes on the papers that interested me, abstracts that raised my curiosity, short and long term interest topics etc. I thought its probably a better idea to arrange atleast the titles in some subgroups and save somewhere so that it would be easy for me to get back later. I did not read all of them completely. Infact, for a few of them, I did not even go beyond the abstract. So, don’t ask me questions. Anyone who is interested in any of these titles can either read them by googling for them or visit the ACL anthology page for ACL’13 and find the pdfs there.

The first two sections below are my current topics of interest. The third one is a general topic of interest. The fourth one includes everything else…that piqued my interest. Fifth section is on teaching CL/NLP…which is also a long term interest topic for me. The final section is about workshops as a whole that I have interest in.

*****

Various dimensions of the notion of text difficulty, readability
* Automatically predicting sentence translation difficulty – Mishra and Bhattacharya
* Automatic detection of deception in child produced speech using syntactic complexity features – Yancheva and Rudzicz
* Simple, readable sub sentences – Klerke and Sogaard
* Improving text simplification language modeling using Unsimplified text data – Kauchak
* Typesetting for improved readability using lexical and syntactic information – Salama et.al.
* What makes writing great?: First experiments on Article quality prediction in the science journalism domain, Louis and Nenkova
* Word surprisal predicts N400 amplitude during reading – Frank et.al.
* An analysis of memory based processing costs using incremental deep syntactic dependency parsing – Schjindel et.al.

Language Learning, Assessment etc.
* Discriminative Approach to fill-in-the-blank quiz generation for language learners
* Modeling child divergences from Adult Grammar with Automatic Error Correction
* Automated collocation suggestion for Japanese second language learners
* Reconstructing an Indo-European family tree from non-native English texts
* Word association profiles and their use for automated scoring of essays -Klebanov and Flor.
* Grammatical error correction using Integer Linear programming
* A learner corpus based approach to verb suggestion for ESL
* Modeling thesis clarity in student essays – Persing & Ng
* Computerized analysis of a verbal fluency test – Szumlanski et.al.
* Exploring word class n-grams to measure language development in children. Ramirez-de-la-Rosa et.al.

NLP for other languages:
* Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison – Esmaili and Salavati
* Identifying English and Hungarian light verb constructions: A contrastive approach – Vincze et.al
* Real-world semi-supervised learning of POS taggers for low-resource languages -Garrette et.al.
* Learning to lemmatize Polish noun phrases – Radziszewski
* Sentence level dialect identification in Arabic – Elfardy and Diab

Others:
* Exploring Word Order Universals: a probabilisitic graphical model approach – Xia Lu.
* An opensource toolkit for quantitative historical linguists
* SORT: An improved source rewriting tool for improved translation
* unsupervised consonant-vowel prediction over hundreds of languages
* Linguistic models for analyzing and detecting biased language.
* Earlier identification of Epilepsy surgery candidates using natural language processing – Matykiewicz et.al.
* Parallels between linguistics and biology. Chakraborti and Tendulkar
* Analysing lexical consistency in translation – Guillou
* Associative texture is lost in translation – Klebanov and Flor

Teaching CL, NLP:
* Artificial IntelliDance: Teaching Machine learning through choreography, Agarwal and Trainor
* Treebanking for data-driven research in the classroom, Lee et.al.
* Learning computational linguistics through NLP evaluation events: the experience of Russian evaluation initiative. Bonch-Osmolovskaya et.al.
* Teaching the basics of NLP and ML in an introductory course to Information Science. Agarwal.

whole workshops and competitions:
* Shared task on quality estimation in Machine translation
* Predicting and improving textual readability for target reader populations (PITR 2013)

Published in: on August 14, 2013 at 9:09 am  Leave a Comment  
Tags:

Automated Grading and Counter Arguments-4 (last)

(Continued from Part-3)
***

Finding 9: machine scoring shows a bias against second-language writers (Chen & Cheng, 2008) and minority writers such as Hispanics and African Americans (Elliot, Deess, Rudniy, & Joshi., 2012]

Report-1: The best part about this report is the stance it takes. It immediately got me interested in it.

“Given the fact that many AWE programs have already been in use and involve multiple stakeholders, a blanket rejection of these products may not be a viable, practical stand.

..
A more pressing question, accordingly, is probably not whether AWE should be used but how this new technology can be used to achieve more desirable learning outcomes while avoiding potential harms that may result from limitations inherent in the technology.”

-This exactly has been my problem with these statements on humanreaders.org website.

The study primarily supports the idea of integrating human and machine assessments, by taking advantage of good things in both, as mentioned below:

“The AWE implementation was viewed comparatively more favorably when the program was used to facilitate students’ early drafting and revising process, and when the teacher made a policy of asking students to meet a preliminary required standard and subsequently provided human feedback. The integration of automated assessment and human assessment for formative learning offers three advantages….”

At least I did not find a direct mention about a “bias” against second language writers in this report! We need to stretch our imagination a bit to reach that conclusion!

Report-2: – The second report was already mentioned in Finding 7. Like before, I did not find a direct relevance of these results to this “finding”. However, I see the point in raising this issue. But, what I don’t understand is that this is just like some American coming and correcting Indian English 😛 So, this kind of “bias” can exist in humans as well. What really is a way to handle this, manually or automatically? This does not make the case favourable to human assessment (IMHO).

Finding 10: for all these reasons, machine scores predict future academic success abysmally (Mattern & Packman, 2009; Matzen & Hoyt, 2004; Ramineni & Williamson, 2013 – not freely accessible)

– I actually did not go through these references beyond their intro and conclusion sections as I felt that “finding” is too much of a blanket statement to be connected to these findings. Skimming through the two freely accessible reports among these confirmed my suspicion. These reports focus more on doing a critical analysis of automated systems and suggesting ways to improve them and combine them with some other things… and not say “machine scores predict future academic success abysmally”.

Part-2 of these findings were focused on the statement: “machine scoring does not measure, and therefore does not promote, authentic acts of writing”

While I intend to stop here (partly because its so time consuming to go through so many reports and partly because of a growing feeling of irritation with these claims), some of these findings-part 2 made me think and some made me smile and some made me fire a question back…

Those that made me think:
“students who know that they are writing only for a machine may be tempted to turn their writing into a game, trying to fool the machine into producing a higher score, which is easily done ”
-This is something that recurs in my thoughts each time I think of automated assessment.. and I know it is not super-difficult to fool the machine in some cases.

“as a result, the machine grading of high-stakes writing assessments seriously degrades instruction in writing (Perelman, 2012a), since teachers have strong incentives to train students in the writing of long verbose prose, the memorization of lists of lengthy and rarely used words, the fabrication rather than the researching of supporting information, in short, to dumb down student writing.”
– This part is actually the main issue. Rather than focusing on just making blanket claims and pushing everything aside, humanreaders.org or any such initiatives should understand the inevitability of automated assessment and focus on how to combine it with human evaluation and other means better!

Those that rather made me smile, although I can’t brush aside these things as impossible

“students are subjected to a high-stakes response to their writing by a device that, in fact, cannot read, as even testing firms admit.”


“in machine-scored testing, often students falsely assume that their writing samples will be read by humans with a human’s insightful understanding”


“teachers are coerced into teaching the writing traits that they know the machine will count .. .. and into not teaching the major traits of successful writing.. .. ”

– No I don’t have any specific comments on these. But, in parts, this is very imaginative…and in parts, it is not entirely impossible.

Those that gave new questions:
“conversely, students who knowingly write for a machine are placed in a bind since they cannot know what qualities of writing the machine will react to positively or negatively, the specific algorithms being closely guarded secrets of the testing firms (Frank, 1992; Rubin & O’Looney, 1990)—a bind made worse when their essay will be rated by both a human and a machine”
-as if we know what humans expect! I thought I wrote very a good essay on Sri Sri in my SSC exam’s question “my favourite poet”… and I got less marks (compared to my past performances) in Telugu, of all things. Apparently, teachers in school and those external examiners did not think alike! (or probably the examiner was a SriSri hater!)

“machines also cannot measure authentic audience awareness”
– Who can? Can humans do that with fellow humans? I don’t think so. I know there are people who think I am dumb. There are also those who think I am smart. There are also those who think I am mediocre. Who is right?

Conclusion of this series:

Although I did not do a real research-like reading and this is not some peer-reviewed article series, I spent some time doing this and it has been an enriching experience in terms of the insights it provided on the field of automated assessment and its criticisms.

What I learnt are two things:
* Automated Assessment is necessary and could not be avoided in future (among several reasons, because of the sheer number of students compared to the number of trained evaluators)
* Overcoming the flaws of automated assessments and efficient ways to combine it with trained human evaluators is more important, realistic and challenging than just branding everything as rubbish.

Although the above two are rather obvious, humanreaders.org “findings” and the way such “findings” immedietly grab media attention convinced me about the above two things more than before! 🙂

Published in: on May 4, 2013 at 12:40 pm  Leave a Comment  
Tags:

Automated Grading and Counter Arguments-3

(Continued from part-2. All parts can be seen here)
***


Finding 5:
machines require artificial essays finished within very short time frames (20-45 minutes) on topics of which student writers have no prior knowledge (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Jones, 2006; Perelman, 2012b; Streeter, Psotka, Laham, & MacCuish, 2002; Wang, & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First, Second, Third, Fourth, 7th references were not freely accessible.

Fifth Report (just search with the title in google): The primary conclusion of this report is: “The automated grading software performed as well as the better instructors in both trials, and well enough to be usefully applied to military instruction. The lower reliabilities observed in these essay sets reflect different instructors applying different criteria in grading these assignments. The statistical models that are created to do automated grading are also limited by the variability in the human grades.”

6th Report
This report studied the correlation between the machines and human raters and concluded, contrary to all previous studies, that there is no significant correlation between them.

While this was an interesting conclusion, I had a few questions which I could not clarify better as those references couldn’t be found online (for free). Here are the questions I have.

1. I think all these Automated Scoring systems work well within their framework. Like, the GRE exam grading works typically for GRE exam like essays but not for say, scoring what I write in my blog. But, they all are really customizable to any such target-specific responses (The NYT article on EdX, which began this series of blog posts from me makes that clear. It needs 100 sample essays and it will learn how to score essays from 101th on, to put it simply.).. So, is it fair to test Intellimetric or whatever system, on some essays that are about something else…and from other domain?

2. As far as I remember, all those other reports used pearson’s correlation. This report used spearman’s correlation. Can we compare the numbers we get directly? Why cant both be included?

-so far except the 4th one, the others seemed to be more about how well the automated scores approximate human judgements (looking at all the abstracts, which are freely available to read!!)

Finding 6: in these short trivial essays, mere length becomes a major determinant of score by both human and machine graders (Chodorow & Burstein, 2004; Perelman, 2012b)

First Report:
I actually found this report to be a very interesting read, for several reasons. Some of them are mentioned below:

“Since February 1999, ETS has used e-rater as one of the two initial readers for the GMAT writing assessments, and, in this capacity, it has scored more than 1 million essays. E-rater’s scores either match or are within one point of the human reader scores about 96% of the time”
-I thought this is an amazing performance (especially with such a high percentage even after grading 1 million essays or more!). I also like this part where they compare the machine-human agreement with human-human agreement, which is why I again think that no single entity should be the sole-scorer (there should be either 2 humans, 2 differently built machines or machine-human combinations).

While I am not completely comfortable with an automated system being the sole scorer in the low-stakes high school exams etc, I have the same feeling about human evaluators too. I think bias can exist both in a human and in a machine (in a machine, because its intelligence is limited. It learns only what it sees). I guess at that level, you always have an option to go back and demand a re-evaluation?

Primary conclusion of this report was that: “In practical terms, e-rater01 differs from human readers by only a very small amount in exact agreement, and it is indistinguishable from human readers in adjacent agreement. But despite these similarities, human readers and e-rater are not the same. When length is removed, human readers share more variance than e-rater01 shares with HR.”
-I think this is what the humanreaders.org use to come up with this finding. I do not know how to interpret this particular finding although it disappoints me a bit. Yet, the ETS report made a very interesting read.

A second conclusion that this “finding” does not focus on, but I found interesting, was the effect of the test taker’s Native Language on the overall scores. “There is a main effect for language in the data of the current study. The mixed and Spanish groups have higher scores on average than the Arabic and Japanese. These
differences remain even when length is removed. For most of the prompts, e-rater01 shows the same pattern of differences across native language groups as HR, even with length differences partialed out. Future work should include additional language groups to see if these results generalize.”
– After the recent Native Language Identification shared task, I got really curious on this very topic – the interaction between the native language and the scores that the test takers get. I think this is something that I might need to study further, irrespective of the disappointments during these readings!

Again, only the first report (from ETS) is freely accessible. If the purpose of humanreaders.org was to actually convey their message to a “curious” commoner, I think they are repeatedly failing in this by providing such inaccessible references. :-(….

Finding 7: machines are not able to approximate human scores for essays that do fit real-world writing conditions; instead, machines fail badly in rating essays written in these situations (Bridgeman, Trapani, & Yigal, 2012; Cindy, 2007; Condon, 2013; Elliot, Deess, Rudniy, & Joshi, 2012; Jones, 2006; Perelman, 2012b; Powers, Burstein, Chodorow, Fowles, & Kukich, 2002; Streeter, Psotka, Laham, & MacCuish, 2002; Wang & Brown, 2008; Wohlpart, Lindsey, & Rademacher, 2008)

First report, second, third, fifth, sixth, seventh, tenth reports are not freely accessible (although looking at the abstracts, I did not get an impression that these reports are against automated scoring in any way!).

A version of 4th report is available. Even this did not give me an impression that it has anything against automated scoring as such.. although there is a critical analysis of where automated scoring fails and it ended with a call for better systems.

(8th and 9th reports are already discussed in Finding 5)

-Again, I understand that this is an issue … but I don’t understand how just avoiding automated scoring will solve it.

Finding 8: high correlations between human scores and machine scores reported by testing firms are achieved, in part, when the testing firms train the humans to read like the machine, for instance, by directing the humans to disregard the truth or accuracy of assertions (Perelman, 2012b), and by requiring both machines and humans to use scoring scales of extreme simplicity

– Like I said before, I could not find a freely accessible page for (Perelman, 2012b). It would have been great if some freely accessible report was provided as additional reading here… because this claim/finding is among those most intriguing ones. First, they say that the machines are bad. Now, they say even the humans are bad! Should we be evaluated at all or not? 😛

Conclusion for this part:
… And so, although its getting more and more disappointing… I decided to continue with these reports further because, even if the analysis and those claims (of humanreaders.org) are flawed according to me, I find those “findings” (or whatever they are) worth pondering about. There may not be a real solution. But, some of these issues need some thinking on the part of people who are capable of doing it better (e.g., teachers and researchers in education, researchers in automated scoring).

(to be continued..)

Published in: on April 27, 2013 at 1:31 pm  Comments (1)  
Tags:

Automated grading and counter arguments-2

(Continued from part-1)
***

So, I started to read their research findings.
(It would have been nice if they also gave online links to those references. Nevertheless, it was not very difficult to google most of them).

Finding 1: computer algorithms cannot recognize the most important qualities of good writing, such as truthfulness, tone, complex organization, logical thinking, or ideas new and germane to the topic (Byrne, Tang, Truduc, & Tang, 2010)

(This report referred here can be read here)
In this, after a brief discussion on the development of eGrader, an automatic grading system, they conclude with the following words, after deciding to not use it for automatic grading.
“The machine reader appears to penalize those students we want to nurture, those who think and write in original or different ways. For us, the subjective element which was as important as the objective aspects of the essays, proved too complex to measure.”

-It is indeed a matter of concern (this above statement) …. but, I wondered if there really was sufficient evidence (at least in this referred report) to extrapolate this observation with a small sample of 33 students to phrase the Finding 1.

10 students in this report came back to the human evaluators for a regrading and got better grades after that. But, won’t this sort of disagreement also happen if there are human evaluators instead of machines? Can we be sure that two humans always give the same score to a document without disagreement? What will we do when there is a disagreement between two human scores? May be we call a third evaluator? Isn’t that what they do with machines too now?

Also, I guess automated scoring, wherever its used in a high stakes assessment scenario (like…grading in an entrance exam or in some competitive proficiency test etc), is not used as a sole decider. It seems to be coupled with atleast one human judge (both GRE and GMAT automatic assessment has atleast one human judge and a second one will be called if there is a disagreement between human and machine scores).

So, what I understood: the “finding 1” might still be true … but this document that is referred there is not related to that statement. Its more like a “null hypothesis” than a “finding”.

Finding 2. to measure important writing skills, machines use algorithms that are so reductive as to be absurd: sophistication of vocabulary is reduced to the average length or relative infrequency of words, or development of ideas is reduced to average sentences per paragraph (Perelman, 2012b; Quinlan, Higgins, & Wolff, 2009)

(I could not trace a freely accessible link to the first reference. Second one here).

Its amazing that the same system that uses so many different measures to estimate various aspects of the language quality of an essay (see the last 4 pages of the pdf above) uses a relatively “surface” measure of lexical sophistication and development (No, I don’t have any better ways to measure it. Don’t ask me such questions!!). However, I think “prompt specific vocabulary usage” description in the end actually handles the sophistication of vocabulary part to some extent. And there is always atleast one human grader to identify things like “out of the box” thinking or novel word usages that are relevant and also competent. These automatic assessment systems don’t seem to be the sole decision makers anyway!

So, I don’t understand the issue, again. Further, I am amazed that so many other positive things from this second report were completely ignored and directly a “finding 2” was finalized by skipping all those!

Finding 3 machines over-emphasize grammatical and stylistic errors (Cheville, 2004) yet miss or misidentify such errors at intolerable rates (Herrington & Moran, 2012)

(The first reference does not seem to be freely available.)
-It was clear from the beginning that the authors of the second paper have a strong bias against automated scoring. Its entirely acceptable .. everyone can have some opinion. But, then, we can’t expect that paper to be objective then, can we? Atleast I did not find it objective. I thought there would be some form of well defined scientific study there. But, all they discussed was one single student essay and generalized that to the performance of the automated system as a whole (I wonder why that is not possible to do even the other way round by finding an essay where it works and conclude equally confidently that automated system is the best! :P). Further, I felt that an analysis as to why the automated system showed those errors was not performed. A major criticism about spelling was that the machine identified words like “texting”, “i.e.” as spelling errors. But, if “i.e.” was supposed to be written as “i.e.,” and texting is not a word in say, English dictionary, I think this should be expected. Infact, I would guess that a very conservative human evaluator too might point these things out. So, based on this single student essay analysis (some of which is debatable), its concluded that the tool should be banished from classrooms… (this is where I start getting disillusioned… should I really continue these readings?? They seem so biased and the analysis seems more like over-generalization than science anyway!)

The only interesting point in this report was about the machine’s bias towards standard American English. I am curious to know more on this aspect. Actually, I did find the starting premise (that the machines overemphasize stylistic errors) interesting… but this report did not live up to its own promise in terms of the quality of analysis provided.

Finding 4 machines cannot score writing tasks long and complex enough to represent levels of writing proficiency or performance acceptable in school, college, or the workplace (Bennett, 2006; Condon, 2013; McCurry, 2010; Perelman, 2012a)

Report 1 – Bennett, 2006: There are two major claims in this. 1) “Larger differences in computer familiarity between students with the same paper writing proficiency would be associated with correspondingly bigger discrepancies in computer writing scores” and 2) “The weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing expert”
-Actually, although 1) seems rather obvious (as we need to “type” essays on a computer), there is no real solution that this report proposes. 2) Ofcourse, the weighting would be different between humans and machines. Machines don’t learn like humans and humans don’t learn like machines! But, when there is not so much of discrepancy between both their scores…and so long as the users are satisfied, I guess this is something that we can live with. Anyway, the paper did not suggest any better alternative.

Condon, 2013; McCurry, 2010
are not freely accessible.

Report4 – Perelman, 2012a: The conclusion of this report is that the way automated essay scoring systems evaluate the language constructs is not actually the way actual writing teachers do it. Although this is an important point to be addressed,
the other person in me is always ready to say: what we learn and what the machine learn need not always be the same. Our routes to reach the same conclusion might really not cross at all!

Concluding this part:
In a sense the whole thing is really amazing. On one hand, they talk about the shortage of human evaluators to grade student scripts. On the other hand, they want a ban on automated assessment. I wonder what exactly is the point of humanreaders.org, even after reading so many reports! I don’t understand their solution, if there is any, yet.

The other claims and their associated reports might help (I hope!)

(To be continued)

Published in: on April 13, 2013 at 11:11 pm  Leave a Comment  
Tags: