What counts as a waste of tax payers’ money in research?

There was a time (2-3 years back) I and my friend/colleague M used to discuss about how doing full-time research in a public university is a waste of tax payers’ money. We were our respective cynical selves at that time, of course. But our general opinion (mine is still the same – I don’t know about my friend’s) was that research in public university should also include teaching/mentoring/administrative duties, and not pure 100% research.

I did not see myself as a future academic at that time, but here I am.

Yesterday, I heard this bizarre comment about “some research” being a waste of tax payers’ money and I have been wondering ever since – how do we decide? and who decides? and what exactly is a waste?

A few months ago (April-May 2017), a famous robotics professor got an award for teaching and gave a talk on teaching in large classrooms. I attended the talk – it was great, and I learned a lot of stuff. He made a calculation in the talk which fascinated me. When you are teaching a technical course for the first time, according to his estimates of the number of hours of preparation you put in for that, if you divide your salary into a per hour basis, you end up getting about the same salary as a Mac Donalds service counter employee. He went on to show how that gets better as time progresses and you repeat courses (Mac Donalds employee is a bad job is not what I mean. I don’t know about that job. I am saying – for all the aura around PhD and professor-hood – it can also be one of the poorly paying job for a PhD).

With that context, one can argue that profs are under-paid (at the junior level at least). One can perhaps support this statement by looking for corporate trainers and their salaries. Also, generally, there are two kinds of academic research that happen:

a) funded (funded by the university or some external agencies)
b) un-funded (where the professor, or some enthusiastic students try to work on small projects by themselves, in their free time)

Funded research by default means –
a) someone taught our idea is worth funding and may have the potential to benefit humanity
b) the students get paid for working on that
c) their tuition gets paid in some cases
Unfunded research ofcourse will not give any of these.

So, in that context, how does one individual decide one research benefits humanity and one research is a waste of tax payers’ money? What are the parameters? Is that judgement reproducible? If it is a waste of money, why did it get funded in the first place? The question is even absurd in the case of unfunded research ofcourse, because none of the involved parties get paid anyway – they are doing it in their spare time.

I mean – one can argue
a) Organizing conferences in fancy hotels is a waste of tax payers’ money.
b) Going to 5,6 national and international conferences a year is a waste of tax payer’s money.
c) Staying in fancy hotels instead of normal ones during these official trips is a waste of money.

But, passing value judgements on a topic of research? How do we decide? What is the rationale? That too, instead of looking for whether the topic itself was explored thoroughly and properly in the presented work. I have never heard such a thing in my life as a conference goer (incidentally, first one was IJCAI 2007). It is a part of “naa-ism” I guess – saying: “Anything I do is great. Everybody else ofcourse does pure trash.”

NOTE: This is not a well thought out critique of the question “Who decides what research project is a waste of tax payers’ money?”. It is more like a dump of my current thoughts. Based on time, I may perhaps write a more well-argued post.

Published in: on September 23, 2017 at 3:31 pm  Comments (5)  

Todas and their songs – Notes from an article

I got curious about the Toda people after noticing in a dictionary that there are 50 or more words related to Buffalo in their language. So, I started reading through the Wikipedia article on them which had a reference to the following article:

Oral Poets of South India: The Todas
M. B. Emeneau
The Journal of American Folklore
Vol. 71, No. 281, Traditional India: Structure and Change (Jul. – Sep., 1958), pp. 312-324
Published by: American Folklore Society
DOI: 10.2307/538564
Stable URL: http://www.jstor.org/stable/538564
(It is free to read if you create a login)

I am just listing a few notes I made to myself while reading in this post here. Some of them are direct quotes from the article and some of them are my summaries.

* The language and culture are apparently quite different from others around.

“The culture of the Todas is just as divergent from its Indian roots as is their language, because of their long isolation (since the beginning of the Christian era, as I think I have now proved) from the general streams of Hindu culture. This isolation was produced both by their geographical situation on a lofty, 8ooo-foot-high plateau and by the general framework of the Hindu caste system within which they and their
few neighbors live. This social framework favors diversity within unity, and on the Nilgiri plateau, an area of forty by twenty miles, has allowed four communities to live symbiotically, but with four remarkably different cultures and four mutually unintelligible languages”

* It is amazing that a community of 600 people have a complex caste system with sub-castes and clans (“right down to the individual family”)

* When I saw a Toda dictionary earlier this week, I wondered at the number of words referring to Buffaloes in their vocabulary, and thought it should have some religious significance. Here is what Emeneau says:

“The care of the buffaloes has been made the basis of religion. Every item of dairy practice is ritualized, from the twice daily milking and churning of butter to the great seasonal shifting of pastures, the burning over of the dry pastures, and the giving of salt to the herds.”

* Songs seem to be a very important part of their culture

” It was not long after my work started on the Toda language that I found that the utterances of greatest interest to the Todas themselves were their songs, and that here was a new example of oral poetry.”

* Linguistic structure of these songs is described in detail by the author. I don’t think I fully understood, but there are things that fascinated me at the first glance:

“Sentences consist of from one sung unit to as many as five or six or even seven, with a possibility of quite complicated syntax. But, one very striking feature of the structure, no such sentence may be uttered without being paired with another sentence exactly parallel to it in syntactic structure and in number of units.”

* They seem to have a song for every event in their culture, and specific words and phrases for such events.

“We do not know much about the history of the song technique, but it became clear after a large number of songs had been recorded, that in the course of the presumably long development of the technique, every theme in Toda culture and every detail of the working out of every theme have been provided with
one or several set patterns of words and turns of phrase for use in song.”

* Found these remarks on the role of songs in their culture quite amusing: “Given the technique and the interest in the songs, a corollary but perhaps unexpected consequence is that every Toda can and does compose songs” and “Every Toda can be his own poet laureate.

* What was interesting was this comment on the music of these songs:

“I was told, however, that ideally every new song that is sung should have a new tune. One composer went so far as to tell me that only the tunes matter; anyone can compose the words.”

* Finally, the concluding remarks had a very curious observation:
“There is in their world view no urge to universalize the themes of their culture and the verbal expression of them. At the same time there is no urge towards self-expression; it is, in fact, an urge that would be out
of place and might even be divisive in the closed culture of a small community. Their poetry then is strictly a miniature and provincial, even parochial, art with many limitations. … .. ”
– I found it pretty cool – not bothering about universalizing their themes. Not bothering about too much of self-expression. But just depicting their own world, community and culture.

Overall, pretty interesting stuff.

Published in: on September 3, 2017 at 7:38 pm  Leave a Comment  

Buffalo Vocabulary

Last month, a friend asked me to get some information for him from a book that was not accessible where he lived. The book is:

Toda Vocabulary: A Preliminary List
Tsuyoshi Nara and Peri Bhaskara Rao
(more info on WorldCat)

Not having heard of the language, and not knowing anything about it, I was just doing the job of an information provider, updating my friend with the information he asked (even if I cannot understand it). However, this morning, as I started skimming through, I was intrigued by the number of words referring to Buffaloes in Toda vocabulary. They should have some kind of cultural significance in the Toda people’s lives! (Wikipedia article on the Toda People gives a context)

I am listing down some of the English meanings (49 words!!!) – not listing the original Toda words for them as it was in phonetic alphabet, print book, and it is difficult to type 🙂

Note: The book is organized in three ways: words sorted by Toda word, by Toda word endings, by English meanings. I browsed through English meanings (in which Buffalo, a buffalo, male buffalo, wild buffalo all come in entries around b, a, m, w and not together. So, what I am writing is not exhaustive. This is just what I found by skimming through the pages quickly.

1. a buffalo allows calf to suck
2. a buffalo gives a side glance before charging
3. a buffalo that goes dry
4. a buffalo ready for milking
5. a buffalo let out to graze early in the morning before being milked
6. barren buffalo
7. buffalo calf (between one and two years of age)
8. buffalo calf (between two and three years of age till it becomes pregnant)
9. buffalo of Kurpoly temple of the kas clan
10. buffalo pen (2 words)
11. buffalo that has given birth to a calf
12. buffalo with beautiful horns
13. buffalo with divine power
14. buffaloes contributed by Poyol
15. buffalo given as a gift to ones daughter or as a share to one’s sons
16. buffaloes that accompany another buffalo that is being driven
17. buffaloes that have gone astray for the night without returning to their pen
18. buffaloes given as gift by the father of a bride to the father of a bridegroom
19. buy and bring a buffalo
20. when a buffalo scratches itself
21. when a buffalo suspends the flow of milk while being milked
22. buffalo that brings luck
23. a person who bought the buffalo
24. a stick traditionally used to drive buffaloes
25. adolescent male buffalo
26. afterbirth of buffaloes
27. callus formed on the thumb due to milking buffaloes
28. ceremony of first milking of temple buffaloes
29. ceremony of giving salt to buffaloes in the season when kor grass grows
30. drive buffaloes on migration
31. drive calf away from udder
32. dry buffalo
33. female buffalo calf
34. female buffalo heifer between 2–3 years of age
35. milch buffalo
36. buffalo that is not pregnant
37. offering buffalo calf in Ti: temple
38. own buffaloes
39. one generation of buffaloes
40. pregnant buffalo
41. relationship between buffaloes and men
42. sacred buffalo/temple buffalo
43. smear buffalo dung on (ritual cleaning)
44. stone or post at which buffalo is killed at funeral
45. two buffaloes that give milk to the same calf
46. vulva of a buffalo
47. wild male buffalo
48. wild female buffalo
49. wild buffalo

In the process, also noticed a few other interesting, specialized words:
words for stone:
* an arrangement of three stones (in front of a temple) in the shape of the Greek letter Pi
* a stone kept at some temples on which some milk is sprinkled by the preist before he takes the milk into the temple (milk-pour-stone)
* a stone placed near a buffalo pen
* stones put on top of a temple
* a stone which marks the place where women receive buttermilk
* stone or post at which buffalo is killed at funeral
* stone used in weight lifting competitions
(all have same word ending as far as I remember)
* large rock standing by itself

Quite a fascinating experience! May be more on this later if I get to read something about Toda people and their culture.

Published in: on September 3, 2017 at 6:27 pm  Comments (1)  

language independence in NLP – some thoughts

In the last session of our reading group, we discussed the following article:

title: “On achieving and Evaluating language independence in NLP”
author: Emily Bender
Linguistic Issues in Language Technology, 2011.
url here
The article is an extended version of a 2009 writeup, about which I wrote here.

To summarize in few words, the article first discusses what does language independence in natural language processing system development mean – in theory and in practice. Then, taking linguistic typology as a source of knowledge, it suggests some do’s and don’ts for NLP researchers working for the development of language independent systems. I liked the idea that true language independence is possible only through incorporation of linguistic knowledge into the system design. It took me only a few seconds to convinced about it when I read that 2009 paper and my opinion did not change in the meanwhile. My experience in working with non-English language datasets in the meanwhile only boosted the opinion.

Reading the article with a bunch of people with a linguistic and not an engineering background this time gave me some new perspective. One most important thing I noticed is this: I think I can say it is fairly common among CS based NLP communities to claim language independence by assuming that the approach that works on one or two languages, several times closely related ones, will work on any other language. I never knew what linguists think about that. The linguists in our group first wondered how can anyone claim language independence in general and how difficult is it to claim language independence. We even briefly went into a philosophical discussion. As someone who started with NLP in a CS department, I should confess I never even thought of it like this until now. People in so many NLP papers claim language independence in an off-hand manner.. and I suddenly started seeing why it could be a myth. That is the “aha” moment for that day.

Anyway, coming back to the paper, after the section 4 where there are the Do’s and Don’ts, I found the section 5 incomplete. This is an attempt to explain how Computational Linguistics is useful in typology and vice-versa – but I did not get a complete picture. There were a couple of recent papers which test the applicability of their approaches on multiple languages from different language families (two examples I can think of from 2015 are: Soricot and Och, 2015 and Müller and Schütze, 2015)

Nevertheless, it is a very well written article and is a must read for anyone who wondered if all the claims of language independence are really true and if there is no implicit considerations that favor some language over the other in the development of natural language processing systems.

Thanks to Maria, Simon, Xiaobin and Marti for all a very interesting discussion!

Published in: on November 21, 2015 at 6:50 pm  Comments (1)  

Automatic question generation for measuring comprehension – some thoughts

In our weekly/fortnightly reading group here, we spent most of the past 2 months discussing about “automatic question generation”. We discussed primarily NLP papers but included a couple of educational research papers as well. NLP papers usually focus on the engineering aspects of the system and are usually heavy on computations required. Educational research papers primarily focus on performing user studies with some approach of question creation and then correlating the user performance with these questions to text comprehension. So, these are usually low on the computational part. That is roughly the difference between the two kinds of articles we chose to read and discuss.

Now, as we progressed with these discussions, and as more people (with diverse backgrounds) joined the group for a couple of sessions, I realized that I am learning to see things in different perspectives. I am now writing this post to summarize what I thought about these articles, what I learnt through these discussions, and what I think about the whole idea of automatic question generation at this point of time. I will give pointers to the relevant articles. Most of them are freely accessible. Leave a comment if you want something you see here and can’t get access. Questions here were of two kinds – factual questions from the text (Who did what to whom kind of things) and fill in the blank kind of questions where one of the key words go missing.

Let me start with a summary of the stuff we discussed in the past few weeks:
a) We first started with the generation of factual questions from any text i.e., the purpose of the system here is to generate questions like – “When was Gandhi born? Where was Gandhi born?” etc., from a biography page on Mahatma Gandhi. Here, we primarily discussed the approach followed by Michael Heilman. More details about the related articles and the released code can be seen here. Here, the primary focus of the approach has been to generate grammatically correct questions.

b) We then moved to more recent work from Microsoft Research, published in 2015, where the task of “generating the questions” is transformed by using crowd sourcing to create question templates. So, the primary problem here is to replicate the human judgements of relevant question templates for a given text, by drawing inferences about the category of the content in a particular section of text through machine learning. (I am trying to summarize in one sentence, but someone wanting to know more please read the article). The resource created and the features used to infer category/section will eventually be released here.

c) At this time, after a slight digression into the cognitive and psycholinguistic aspects of gap filling probabilities, we got into an article which manually designed a fill-in-the-blank kind of test which allegedly measures reading comprehension. They concluded that such kind of tests are quick to create, take less time to test, and still do what you want out of such test (i.e., understand how much the readers understood).

d) Naturally, the next question for us was: “How can we generate the best gaps automatically?”. Amidst a couple of articles we explored, we again picked an older article from Microsoft Research for discussion. This is about deciding what gaps in a sentence are the best to test the “key concepts” in texts. Again, the approach relies on crowd sourcing to get these judgements from human raters first, and then develops a machine learning approach to replicate this. The data thus created, and some details about the machine learning approach implementation can be found here.

Now, my thoughts on the topic in general:
a) To be able to generate real “comprehension” testing questions from any possible text, we should make sure that we are not falsefully ending up testing the ability of a reader to remember the text. So, I did not get a clear picture of how fill-in-the-blank questions avoid this pitfall. Generating who? what? kind of questions instead of fill-in-the-blanks perhaps to some extent covers this up. Yet, if these questions only require you to know that one sentence, how are they really measuring comprehension of the whole piece of text, when comprehension can include drawing inferences from multiple parts of the text?

b) One dis-satisfying aspect of all these readings has been that: people who do user-studies don’t talk about the scalability of their method beyond a laboratory setup and people who engineer technological solutions don’t discuss if these approaches are really working with real users in testing their comprehension. I was surprised that several NLP papers I read on the topic in the past weeks (apart from those mentioned above) talk about question generation approaches, evaluate on some dataset about the correctness or relevance of the “questions” generated (be it gap-filling or questions with a question mark). But, I haven’t seen anyone do an evaluation on the possible consumers of such an application. The only exception in my readings has been – Michael Heilman’s PhD thesis, where they evaluated their question generation approach as a possible assisting tool for teachers to prepare questions.

On one hand, I think this is a very interesting topic to work on, with all the possible commercial and not-so-commercial real-life impact it can have in these days of massive online education and non-conventional ways of learning. Clearly, there is a lot of work going on on various ways to generate questions automatically, which is a very useful method to have in such massive learning scenarios. We know what approaches “kind of” work and what don’t, in generating the questions as such. However, I wonder what exactly are we trying to achieve by not doing the final step of user evaluation with these computational approaches. If we do not know whether all the fancy approaches are going really doing what they are supposed to do (testing comprehension of the readers), what is the point? To use a Tennis term, the missing “follow throughfollow through” is a problem for much of this work remaining unusable for the actual consumers of this kind of work – teachers, learners and other such people in a learning environment. I am not a dreamer, so I know the difficulties in working across groups and I can guess the reasons for the missing “follow through” (especially, as someone currently in the academia!).

The only way I see the “follow through” being possible is in an ed-tech company, since they have to do the user evaluation to get going 🙂 Perhaps I should wait and see if new ed-tech startups working on learning analytics and measuring learning outcomes can come up with effective solutions. On that optimistic note, I should perhaps end my post for now.

Acknowledgements: I have benefited a lot from the comments on these papers by Magdalena Wolska, Maria Chinkina, Martí Quixal, Xiaobin Chen and Simón Ruiz, who attended some or all of these meetings in the past few months. Long live discussions! 😉

Published in: on October 22, 2015 at 4:06 pm  Comments (5)  

Comments on the Editorial of “Machine Learning For Science and Society” issue

For whatever reason, I am more fascinated by the applied aspects of any research and Machine Learning (ML) is not an exception. While I use machine learning approaches in my work and studied basics during my masters (.. and on and off during my PhD now), I never found much information on what happens to all the hundreds of new algorithms proposed every year. How many of them actually get used by non-ML researchers working on some other problem? How many of them get used by others who want to solve some real-world problems?

I attended the Machine learning summer school in 2013, where, for two weeks, I was fortunate enough to listen to some of the best researchers in the field speak about ML in general and their work in particular. However, I got a feeling that the community is not so keen on a reality check about the applicability of these algorithms. So, basically, the questions remained.

Machine learning that matters” (Kiri Wagstaff, 2012) is an article I keep thinking about whenever this sort of discussion comes up with fellow grad-students. (My thoughts on it here). In the past few days, there have been a lot of short online/offline discussions about how an effort to do more evaluation on real-world scenarios/datasets is perceived by reviewers in various academic conferences (disclaimer: these discussions are not exclusively about ML but some of the people in these discussions happen to be grad-students working in ML).
We, with our own shortcomings and limitations drew some conclusions (which are not of interest to anyone perhaps) and I was reminded of another inspiring article that I thought about several times in the past few months.

The Article: Machine learning for science and society (Editorial)
Authors: Cynthia Rudin and Kiri L. Wagstaff
Details: Machine Learning (2014) 95:1–9
Url here

This article is an editorial for a special issue of Machine Learning Journal called “Machine Learning For Science and Society“. The issue is a collection of research papers that tackle some real life problems ranging from water pipe condition assessment to online-advertising through ML based approaches. While I did not go through all the papers in this edition yet, I think the editorial is worth a read to any person having a remote curiosity about the phrase “Machine Learning”.

It discusses the issues that arise when you decide to study the real-life impact of ML- What exactly counts as evaluation from the applied perspective? How much of this evaluation differs based on the application domain? How do domain experts see ML – do they look for a great model or a good model that is interpretable? How does the ML community see such research? What is ML good for? What is the need for this special focused issue at all? etc.,

I will not go on and on like this, but I would like to quote a few things from the paper, hoping its not a copyright violation.

The abstract:

“The special issue on “Machine Learning for Science and Society” showcases machine learning work with influence on our current and future society. These papers addressseveral key problems such as how we perform repairs on critical infrastructure, how we predict severe weather and aviation turbulence, how we conduct tax audits, whether we can detect privacy breaches in access to healthcare data, and how we link individuals across census data sets for new insights into population changes. In this introduction, we discuss the need for such a special issue within the context of our field and its relationship to the broader world. In the era of “big data,” there is a need for machine learning to address important large-scale applied problems, yet it is difficult to find top venues in machine learning where such work is encouraged. We discuss the ramifications of this contradictory situation and encourage further discussion on the best strategy that we as a field may adopt. We also summarize key lessons learned from individual papers in the special issue so that the community as a whole can benefit.”

Then, the four points starting from: “If applied research is not considered publishable in top ML venues, our field faces the following disadvantages:”

1. “We lose the flow of applied problems necessary for stimulating relevant theoretical work ….”
2. “We further exacerbate the gap between theoretical work and practice. …”
3. “We may prevent truly new applications of ML to be published in top venues at all (ML or not). …”
4. “We strongly discourage applied research by machine learning professionals. … “

(Read the relevant section in the paper for details.)

The paragraph that followed, where examples of a few applications of ML were mentioned:

“The editors of this special issue have worked on both theoretical and applied topics, where the applied topics between us include criminology (Wang et al. 2013), crop yield prediction (Wagstaff et al. 2008), the energy grid (Rudin et al. 2010, 2012), healthcare (Letham et al. 2013b; McCormick et al. 2012), information retrieval (Letham et al. 2013a), interpretable models (Letham et al. 2013b; McCormick et al. 2012; Ustun et al. 2013), robotic space exploration (Castano et al. 2007; Wagstaff and Bornstein 2009; Wagstaff et al. 2013b), and scientific discovery (Wagstaff et al. 2013a).”

Last, but not the least, the comments on inter-disciplinary research just had such an amount of resounding truth in them that I put the quote up in my room and a few others did the same in the inter-disciplinary grad school I am a part of. 🙂

“..for a true interdisciplinary collaboration, both sides need to understand each other’s specialized terminology and together develop the definition of success for the project. We ourselves must be willing to acquire at least apprentice-level expertise in the domain at hand to develop the data and knowledge discovery process necessary for achieving success. ”

This has been one of those articles which I thought about again and again… kept recommending to people working in areas as diverse as psychology, sociology, computer science etc., to people who are not into academic research at all! 🙂 (I wonder what these people think of me for sending the “seemingly unrelated” article to read though.)

P.S.: It so happens that an ML article inspired me to write this post. But, on a personal front, the questions posed in the first paragraph remain the same even for my own field of research – Computational Linguistics and perhaps to any other field too.

P.S. 2: This does not mean I have some fantastic solution to solve the dilemmas of all senior researchers and grad students who are into inter-disciplinary and/or applied research and at the same time don’t want to perish since they can’t publish in the conferences/journals of their main field.

Published in: on July 8, 2014 at 3:15 pm  Leave a Comment  

Notes from EACL2014

(This is a note taking post. It may not be of particular interest to anyone)


I was at EACL 2014 this week, in Gothenburg, Sweden. I am yet to give a detailed reading to most of the papers that interested me, but I thought its a good idea to list down things.

I attended the PITR workshop and noticed that there are more number of interested people both in the authors and audience compared to last year. Despite the inconclusive panel discussion, I found the whole event interesting and stimulating primarily because of the diversity of topics presented. There seems to be an increasing interest in performing eye-tracking experiments for this task. Some papers that particularly interested me:

One Step Closer to Automatic Evaluation of Text Simplification Systems by Sanja Štajner, Ruslan Mitkov and Horacio Saggion

An eye-tracking evaluation of some parser complexity metrics – Matthew J. Green

Syntactic Sentence Simplification for FrenchLaetitia Brouwers, Delphine Bernhard, Anne-Laure Ligozat and Thomas Francois

An Open Corpus of Everyday Documents for Simplification TasksDavid Pellow and Maxine Eskenazi

An evaluation of syntactic simplification rules for people with autism – Richard Evans, Constantin Orasan and Iustin Dornescu

(If anyone came till here and is interested in any of these papers, they are all open-access and can be found online by searching with the name)


Moving on to the main conference papers,  I am listing here everything that piqued my interest, right from papers I know only by titles for the moment to those for which I heard the authors talk about the work.

Parsing, Machine Translation etc.,

* Is Machine Translation Getting Better over Time? – Yvette Graham; Timothy Baldwin; Alistair Moffat; Justin Zobel

* Improving Dependency Parsers using Combinatory Categorial Grammar-Bharat Ram Ambati; Tejaswini Deoskar; Mark Steedman

* Generalizing a Strongly Lexicalized Parser using Unlabeled Data- Tejaswini Deoskar; Christos Christodoulopoulos; Alexandra Birch; Mark Steedman

* Special Techniques for Constituent Parsing of Morphologically Rich Languages – Zsolt Szántó; Richárd Farkas

* The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation- Daniel Ortiz-Martínez; Francisco Casacuberta

* Joint Morphological and Syntactic Analysis for Richly Inflected Languages – Bernd Bohnet, Joakim Nivre, Igor Bogulavsky, Richard Farkas, Filip Ginter and Jan Hajic

* Fast and Accurate Unlexicalized parsing via Structural Annotations – Maximilian Schlund, Michael Luttenberger and Javier Esparza

Information Retrieval, Extraction stuff:

* Temporal Text Ranking and Automatic Dating of Text – Vlad Niculae; Marcos Zampieri; Liviu Dinu; Alina Maria Ciobanu

* Easy Web Search Results Clustering: When Baselines Can Reach State-of-the-Art Algorithms – Jose G. Moreno; Gaël Dias


* Now We Stronger than Ever: African-American English Syntax in Twitter- Ian Stewart

* Chinese Native Language Identification – Shervin Malmasi and Mark Dras

* Data-driven language transfer hypotheses – Ben Swanson and Eugene Charniak

* Enhancing Authorship Attribution by utilizing syntax tree profiles – Michael Tschuggnall and Günter Specht

* Machine reading tea leaves: Automatically Evaluating Topic Coherence and Topic model quality by Jey Han Lau, David Newman and Timothy Baldwin

* Identifying fake Amazon reviews as learning from crowds – Tommaso Fornaciari and Massimo Poesio

* Using idiolects and sociolects to improve word predictions – Wessel Stoop and Antal van den Bosch

* Expanding the range of automatic emotion detection in microblogging text – Jasy Suet Yan Liew

* Answering List Questions using Web as Corpus – Patricia Gonçalves; Antonio Branco

* Modeling unexpectedness for irony detection in twitter – Francesco Barbieri and Horacio Saggion

* SPARSAR: An Expressive Poetry reader – Rodolfo Delmonte and Anton Maria Prati

* Redundancy detection in ESL writings – Huichao Xue and Rebecca Hwa

* Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules – Advaith Siddharthan and Angrosh Mandya

* Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under length constraints – Annie Louis and Ani Nenkova

* Automatic Detection and Language Identification of Multilingual Document – Marco Lui, Jey Han Lau and Timothy Baldwin

Now, in the coming days, I should atleast try to read the intros and conclusions of some of these papers. 🙂

Published in: on May 2, 2014 at 3:10 pm  Leave a Comment  

On Openmindedness

On an impulse, I started looking at the issues of a journal called Educational Researcher. I just started looking (just looking) at all the titles of all articles since 1972. One of the titles I found was: “On the Nature of Educational Research” and these were the concluding remarks from that article.

“Openmindedness is not empty mindedness, however, and it is not tolerance of all views good or bad. It is having a sincere concern for truth and a willingness to consider, test, argue and revise on the basis of evidence our own and others’ claims in a reasonable and fair manner (Hare, 1979). This doesn’t mean that we will always reach agreement, or even that we will always be able to understand and appreciate the arguments of others, or that we cannot be committed to a position of our own. Openmindedness only requires a sincere attempt to consider the merits of other views and their claims. It does not release us from exercising judgement.”

From: “On the Nature of Educational Research” by Jonas F.Soltis. Educational Researcher. 1984. 13 (5)
If anyone has access, it could be read here.

The Hare, 1979 referred in this quote is this.

I wonder if the quote is only valid for that context of education!

Published in: on April 15, 2014 at 1:03 pm  Comments (1)  

Significant peace

Now, the amount of mental peace I felt after reading this (even if it is just for a few moments), makes it inevitable that I should drop a line or two about it in my blog 🙂 Even if its momentary, I don’t consider the peace as random or arbitrary. I consider it significant ;-).

The questions on the use of statistical significance for large datasets have been bugging me for sometime now although I never really did anything about it. The questions only kept getting back more and more frequently. Especially each time a reviewer asked about significance tests, I wondered – “Won’t everything become significantly different if you have a large N?”. As the perennial fledgling researcher, although, my first instinct is to doubt my own understanding of the process.

I came across this piece “Language is never, ever, ever, random” by Adam Kilgariff, which brought me some mental peace in what is (in my imagination) one of the very confusing phases of my life at the moment 🙂

Here are the details of the paper:
Language is never, ever, ever, random
by Adam Kilgariff
Corpus Linguistics and Linguistic Theory 1-2 (2005), 263-276

The abstract:
“Language users never choose words randomly, and language is essentially non-random. Statistical hypothesis testing uses a null hypothesis, which posits randomness. Hence, when we look at linguistic phenomena in corpora, the null hypothesis will never be true. Moreover, where there is enough data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. We present experimental evidence of how arbitrary associations between word frequencies and corpora are systematically non-random. We review literature in which hypothesis testing has been used, and show how it has often led to unhelpful or misleading results.”

And the take home message (acc. to me):
Hypothesis testing has been used to reach conclusions, where the difficulty in reaching a conclusion is caused by sparsity of data. But language data, in this age of information glut, is available in vast quantities. A better strategy will generally be to use more data Then the difference between the motivated and the arbitrary will be evident without the use of compromised hypothesis testing. As Lord Rutherford put it: “If your experiment needs statistics, you ought to have done a better experiment.”

Published in: on March 4, 2014 at 11:50 am  Leave a Comment  

The Stronger – August Strindberg

Persona” was the first Ingmar Bergman movie I watched, in mid-2008 or so. Since then, I watched a couple of his movies, read some of his writings, reached Strindberg from him in the past few years. However, “Persona” remained the most intriguing movie, although its not my favorite Bergman movie. Although I don’t think I understand the movie, it was the one that raised my curiosity about Bergman as a writer and set me on the path of watching his other movies. While listening to the lectures on Bergman in Scandinavian Film and Television course on coursera, I learnt that Strindberg’s one-act play, “The Stronger” was an inspiration for “Persona”.

[The word “inspiration” is very different from “copy”. Both the play and the movie are independent entities and are equally worth checking out. I personally would consider Persona to be a much more complex psychological drama and its much longer.]

Now, “The Stronger” did not particularly fascinate me. But it is hard to not think about the characters and about their possible interpretations, after reading the play. Its short, very short, but has its impact on the reader nevertheless. I will not say anything more, but will quote something that I read again and again in the play (No, not because I don’t understand English – but because the characters came alive in front of my eyes when I read the monologue).

“Everything, everything came from you to me, even your passions. Your soul crept into mine, like a worm into an apple, ate and ate, bored and bored, until nothing was left but the rind and a little black dust within. I wanted to get away from you, but I couldn’t; you lay like a snake and charmed me with your black eyes; I felt that when I lifted my wings they only dragged me down; I lay in the water with bound feet, and the stronger I strove to keep up the deeper I worked myself down, down, until I sank to the bottom, where you lay like a giant crab to clutch me in your claws–and there I am lying now.

I hate you, hate you, hate you! And you only sit there silent–silent and indifferent; indifferent whether it’s new moon or waning moon, Christmas or New Year’s, whether others are happy or unhappy; without power to hate or to love; as quiet as a stork by a rat hole–you couldn’t scent your prey and capture it, but you could lie in wait for it! “

Here is an interesting analysis of the play.

A few months back, I bought “Persona”‘s screenplay and found a pdf of critical essays on Persona. Perhaps, its time to start reading them soon! 🙂

Published in: on February 23, 2014 at 1:16 pm  Leave a Comment