This post is about a paper that I read today (which inspired me to write a real blog post after months!)
The paper: Linguistically Naive!= Language Independent: Why NLP Needs Linguistic Typology
Author: Emily Bender
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics, pages 26–32. ACL.
In short, this is a position paper, that argues that incorporating linguistic knowledge is a must if we want to create truly language independent NLP systems. Now, on the surface, that looks like a contradictory statement. Well, it isn’t ..and it is common sense, in.. er..some sense ;)
So, time for some background: an NLP algorithm that offers a solution to some problem is called language independent if that approach can work for any other language apart from the language for which it was initially developed. One common example can be Google Translate. It is a practical example of how an approach can work across multiple language pairs (with varying efficiencies ofcourse, but that is different). The point of these language independent approaches is that, in theory, you can just apply the algorithm on any language as long as you have the relevant data about that language. However, typically, such approaches in contemporary research eliminate any linguistic knowledge in their modeling and there by make it “language” independent.
Now, what the paper argues for is clear from the title – “linguistically naive != language independent”.
I liked the point made in section-2, where in some cases, the surface appearance of language independence is actually a hidden language dependence. The specific example of ngrams and how efficiently they work, albeit for languages with certain kind of properties, and the claim of language independence – that nailed down the point. Over a period of time, I became averse to the idea of using n-grams for each and every problem, as I thought this is not giving any useful insights neither from a linguistic nor from a computational perspective (This is my personal opinion). However, although I did think of this language dependent aspect of n-grams, I never clearly put it this way and I just accepted that “language independence” claim. Now, this paper changed that acceptance. :-)
One good thing about this paper is that it does not stop there. It also explains about approaches that use language modeling but does slightly more than ngrams to accommodate various types of languages (factored language models) and also talks about how a “one size fits all” approach won’t work. There is this gem of a statement:
“A truly language independent system works equally well across languages. When a system that is meant to be language independent does not in fact work equally well across languages, it is likely because something about the system design is making implicit assumptions about language structure. These assumptions are typically the result of “overfitting” to the original development language(s).”
Now, there is this section on language independence claims and representation of languages belonging to various families in the papers of ACL 2008. This concludes saying:
“Nonetheless, to the extent that language independence is an important goal, the field needs to improve both its testing of language independence and its sampling of languages to test against.”
Finally, the paper talks about one form of linguistic knowledge that can be incorporated in linguistic systems – linguistic typology and gives pointers to some useful resources and relevant research in this direction.
And I too conclude the post with the two main points that I hope people noticed in the research community:
(1) “This paper has briefly argued that the best way to create language-independent systems is to include linguistic knowledge, specifically knowledge about the ways in which languages vary in their structure. Only by doing so can we ensure that our systems are not overfitted to the development languages.”
(2) “Finally, if the field as a whole values language independence as a property of NLP systems, then we should ensure that the languages we select to use in evaluations are representative of both the language types and language families we are interested in.”
Good paper and considerable amount of food for thought! These are important design considerations, IMHO.
The extended epilogue:
At NAACL-2012, there was this tutorial titled “100 Things You Always Wanted to Know about Linguistics But Were Afraid to Ask“, by Emily Bender. At that time, although I in theory could have attended the conference, I could not, as I had to go to India. But, this was one tutorial that caught my attention with its name and description and I really wanted to attend it.
Thanks to a colleague who attended, I managed to see the slides of the tutorial (which I later saw on the professor’s website). Last week, during some random surfing, I realized that an elaborate version was released as a book:
Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax
by Emily Bender
Pub: Synthesis Lectures on Human Language Technologies, Morgan and Claypool Publishers
I happily borrowed the book using the inter-library loan and it traveled for a few days and reached me from somewhere in Lower Saxony to here in Baden-Württemburg. Just imagine, it travelled all the way just for my sake! ;) :P
So, I started to go through the book. I, even in the days of absolute lack of any basic knowledge on this field, always felt that natural language processing should involve some form of linguistic modeling by default. However, most of the successful so-called “language independent” approaches (some of which also became the products we use regularly, like Google Translate and Transliterate) never speak about such linguistic modeling (atleast, not many that I read).
There is also this Norvig vs Chomsky debate, about which I keep getting reminded of when I think of this topic. (Neither of them are wrong in my view but that is not the point here.)
In this context, I found the paper particularly worth sharing. Anyway, I perhaps should end the post. While reading the introductory parts of Emily Bender’s book, I found a reference to the paper, and this blog post came out of that reading experience.