Two decades of statistical language modelling

Well, I think I began reading this in the lastweek of may. With lots of breaks, I completed it towards mid-june. With a second round of breaks, I am finally writing about it now :)

The title of the paper is: “Two decades of statistical modelling: Where do we go from here”
Author: Ronald Rosenfeld

Though I have been moving around with this phrase “statistical modelling” for almost three years now, it never appealed to me until some 3 months back. I don’t know why..perhaps, I dont want machines to “grow” ;) I dunno. Coming to the paper, I felt it serves its purpose. It basically summarizes different statistical language modelling approaches and also points to future directions of research. Let me summarize the paper, the way I see it:

1. It started with an introduction to statistical language modelling – what does it mean etc. Well, to say in simple words, “Statistical language modelling” refers to “capturing the regularities in language for the purpose of improving the performance of various Natural Language applications.

2. Then, it gave examples of how statistical modelling is used in various natural language processing applications.

3. It was commenting on how “language models dont actually realize that they are modelling a language” and on how to “put language back in modelling”.

4. It gave a brief overview of various Statistical language modelling techniques, principles, measures to evaluate, common weaknesses etc.

5. It then pointed towards future models of language processing and then, discussed about the challenges of statistical language modelling for future :)

More than anything else, this paper is a wealth of references. At around the same time I began reading this, I was struggling with something, for which one of these references provided the answer. :)
Well, to conclude, I’d suggest that people MUST read it, if they are interested in statistical language modelling – Who am i to suggest, anyways? those who are interested might have read this already ;)

And finally, some baatcheet : I was sitting in a hall 2 years back, listening to the same Roni Rosenfeld talking on “from natural language to language of life”, @Indian Institute of Science – wondering what the hell is all this? I don’t understand a bit!! I doubt if my peanut brain can understand that discussion even now, though ;)

The paper can be accessed here.

Published in: on July 4, 2009 at 11:39 am Comments (1)

Automatic generation of Tamil lyrics for melodies

Paper details: “Automatic Generation of Tamil Lyrics for Melodies.”
Authors: Ananth Ramakrishnan A., Sankar Kuppan and Sobha Lalitha Devi

I was browsing through the schedule page of a workshop on: “Computational Approaches to Linguistic Creativity”, 2009. I came across this title – “Automatic generation of Tamil lyrics for melodies” and was quite fascinated by it. I am perennially suspicious about works on such sci-fi topics and their efficiency levels. Nevertheless, I was eager to read it. Back then, the workshop did not happen and the paper was not accessible for download. Now, it is, and I finally got a chance to have a look at it (Interested? download-here)

So, as the name indicates, this explains a system which generates lyrics for a given melody, automatically. They did this for Tamil. There was a kind of general overview of the process and a mention about related work. There was this reference to a work about a “poetry generation system” !! I was shocked to a considerable extent. Most of human poetry itself is unreadable and sometimes crappy. We think about Poetry generation! Man’s imagination indeed roams freely in thin air :) There was even this reference to some work on “lyric generation strategies” and I thought – “Oh! this is not sci-fi then, if so many are working in this direction!” :)

Coming to the paper, the process of lyric generation involves 2 steps:
1. Syllabic pattern generation
2. Identifying a phrase matching this pattern, as well as satisfying other word/sentence/rhyming requirements.

For the first part, they used a notation called KNM. K stands for Kuril-Short vowel, N for Nedil- Long vowel and M for Mei-Consonants. Taking their own example, the word thAmarai will be broken as “thA-ma-rai” and will be labelled as NKN. To generate such patterns for a given melody, they have used Machine Learning (specifically the Conditional Random Fields aka CRFs) to train a system to learn these patterns. The system was trained with sample film songs and their lyrics as input (and… I got doubts about the size of data they trained with..). This trained model is used to then label a given melody with a syllabic pattern. This pattern is then given to a sentence generation module which generates a sentence that satisfies the following conditions:
1. Words should match the syllabic pattern
2. Sequence of words should have a meaning.

This is like a baseline level work and they mention about the ways they plan to improve their system in future. They also plan to experiment more with different strategies as well as different domain data sets. Finally, they mention my “sci-fi” idea of poetry generation again :)
This is an extra brief summary about it. For more details, go on and visit it online.

Published in: on June 11, 2009 at 10:27 am Comments (11)

On Smoothing Techniques

I know, I am going beyond the scope of the way I wanted my blog to be when I began it. But, three years is a long time and thoughts run wild. :) I know, perhaps I am going beyond the scope of my readers’ expectations too. But, this is my only mouthpiece :)

I heard this name called: “Kneser-Ney Smoothing” the other day and got curious to know what it is. (To understand what Smoothing means, go to the wiki page here). When I did my introductory course in Natural language processing, I came to know about smoothing for statistical language modelling (in late 2006). Back then, as far as I remember, we read about only three things:
1. Add one smoothing
2. Turing estimation
3. Witten-Bell smoothing
- I ofcourse dont remember anything but the names, though :p

Okay, in my over-curiosity to know more about Kneser-Ney smoothing, I found this presentation on smoothing techniques in NLP by Bill MacCartney. Again, since I don’t need anything from this specifically, at the moment, I just browsed through the PPT to understand the gist of it. Heres what I learnt on different smoothing mechanisms:

1. Add-one smoothing is the simplest possible and least effective one – as the textbooks say.
2. There is something called Additive smoothing, which, from my understanding is similar to Add-one smoothing, except that here, its not add-1 but add-some number between 0 and 1.
3. Good Turing estimation – which aims at re-allocating the probabilities of those n-grams that occur (r+1) times in training data to the mass of n-grams that occur r-times. [If you are more curious, and need an example, see the PPT slides. If you extra curious, go ahead and visit the wiki page]
4. Jelinek-Mercer smoothing – is a kind of interpolative method. Quoting from the ppt: “nth-order smoothed model is defined recursively as a linear interpolation between the nth-order ML model and the (n − 1)th-order smoothed model.” Did not realize that theres a name for this :) I used a similar idea sometime back, without realizing that theres a name for it!
5. Katz Smoothing – This can best be explained by using the words from the PPT again:
“Count mass subtracted from nonzero counts is redistributed among the zero-count bigrams according to next lower-order distribution”
6. Witten Bell Smoothing – A kind of Jelinek Mercer smoothing.
7. Absolute Discounting – From what I understood, this is a kind of hybrid between interpolation kind of smoothing like Witten bell and discounting kind of smoothing like Katz.
8. Kneser-Ney Smoothing – the one for which this search began! this was described as an extension of absolute discounting. Anyways, after reading all these, I actually forgot that my search began with this name!

There were some comments on interpolation kind of smoothing vs discounting kind of smoothing – compare and contrast stuff.

On the whole, educative. Would have been interesting if I were actually a student of this course now, and was asked to implement all these programatically and write a report with performance analysis :)

Published in: on June 4, 2009 at 11:02 am Comments (4)

Data Measures that characterize classification problems

Somehow, in random surfing, I came across this Master’s dissertation dated Feb 2008, from University of Pretoria, by Christiaan Maarten Van Dar Walt. It is titled – “Data Measures that characterize classification problems. Well, I am a total dumbo as far as the mathematics of classification is concerned. Once or twice, I tried to decipher the myth, but each time, i have been too intelligent to realize that its pointless to write with a broken pencil. (Now, if you don’t have enough math foundation, do you really expect to understand internals of these stuff?)

Reading through this thesis, I felt its a very clearly written thesis, which was educative, at least for me and informative too. I am not going to write an elaborate description of this thesis. Let me mention in brief, what each chapter contains. To read it or not is left to the enthusiasts. I just read it in parts randomly and found that lot of investigation went in to it. Here you go :

Essentially, this thesis aims at understanding the relationship between the nature of data and the choice of a particular classifier. It goes in to performing this “to do” in the following steps:

1. Identifying the properties of data, which will affect a classifier performance
2. Proposing measures to quantify these properties
3. Validating the efficiency of these measures
4. Use these measures, build a meta classifier and explain the predictions regarding classification using them.
5. Explain the results of their experiments, interpretation, contributions to research, short comings of this work and future directions.

There is a great deal of background work material provided for the enthusiasts, which I personally liked – for the depth of bibliography :)

I haven’t read the thesis top to bottom, but I can be sure that it gives a fair idea of the choice of a classifier for a specific kind of data set and issues involved in the process of choosing. However, I wondered that they should have at least mentioned about the impact of domain of the data set on classifier choice. Or, the relation between the statistical nature of the data set and its domain put together on classifier choice.

Anyways, the thesis can be accessed here. I liked it for its clarity. I’d go back to this when I really do some work which needs this background, quite confident about the content.

Published in: on June 2, 2009 at 11:11 am Leave a Comment

May all your wishes come true!!

Wondering if I am praying for you? :) Not at all! I came across this paper:
May All YourWishes Come True: A Study of Wishes and How to Recognize Them
- Funny name for a computer science paper, eh? I was browsing through the Conference schedule of NAACL-HLT 2009 (North American Chapter of the Association for Computational Linguistics – Human Language Technologies) and found this paper over there. The conference is scheduled to be held from May 31st to June 5th 2009.

So, whats it about?: As the name indicates, this paper aims at developing a wish detector, which will enable extraction of information about “wishes” by people. This can be understood as a branch of sentiment analysis, in the sense that – if you consider a product review, this kind of a tool will let the manufacturers know about the user expectations and desires, which can be kept in mind while developing next versions.

WISH corpus?: Well, it seems, every year, theres something called “balldrop” at Newyork city Times Square, during the new years eve. “In December 2007, the Times Square Alliance, coproducer of the Times Square New Year’s Eve Celebration, launched a Web site called the Virtual Wishing Wall that allowed people around the world to submit their New Year’s wishes. These wishes were then printed on confetti and dropped from the sky at midnight on December 31, 2007 in sync with the ball drop” – says this paper. And, the authors gained access to this Wish corpus, and used this for their work. Ok, to give you an example of a wish: “I want to be the master of the world” is a wish. “Let everybody be happy and prosperous” is a wish… :)

Analysis of the WISH corpus: They have analyzed this Corpus according to the Topic and Scope of the wishes, using a pre-formed categorization of Topics (Eg: Love, Happiness, Health, Peace, Money etc..11 categories in all) and Scope (Self, World, family etc… 6 in all). Further, they analyzed these wishes according to geographical location (US and Non-US) and concluded that wishes differ in topic and scope with geographical location (Their results were statistically significant) .

Building Wish Detectors: And, here comes the actual part. Here, for the purpose of a baseline, two simple wish detectors are built first:
1. Manually looking for sentences containing : “I wish…” or “I hope..” etc can give you enough number of wishes from a domain. So, some wish-templates were obtained by analyzing the text patterns in sentences, which can indicate a wish, to make a simple rule based wish classifier. If a sentence matches some rule or pattern to some extent, it is a wish. Else,it is not. The authors say that this method might have a good precision but less recall.
2. “Another simple method for detecting wishes is to train a standard word-based text classifier using the labeled training set in the target domain.” – This might have a better recall, but lower precision.

Now, enter the dragon: The authors a method of automatically discovering the wish templates and a learning mechanism to learn the wish template features. They have tested this classifier on various domains. This training was done using WISH corpus and the testing was done on product reviews and politics!

Pretty interesting paper, though i did not get a “Wow!” feeling. The concept of domain independent wish template generation interested me, though.

For enthusiasts, the paper can be read here.

Published in: on May 27, 2009 at 10:13 am Comments (2)

Breaking Audio Captchas

Details: Jennifer Tam, Jiri Simsa, Sean Hyde, and Luis von Ahn. Breaking Audio CAPTCHAs. In Advances in Neural Information Processing Systems (NIPS).

Firstly, the idea of audio captcha itself intimidates me, imagining the whole process that an automated device should go through, if it wants to decipher it. When I chanced across this paper, I was so sure that I can’t make the head or tail out of it. Hence, I did not touch it for the past 1 month. It was left open on my browser, for I was reluctant to close it. Finally, I read it to some extent today. Now, I can only identify the head or tail and still don’t know anything in depth, since all the speech processing stuff went above my head. Yet, I try to summarize what I understood.

Audio Captchas can be understood as those in which the user will be asked to listen to an audio which consists of words to be identified, combined with enough noise. I understand the target users most of the times were the visually impaired, since they can’t access the image based Captchas. This paper analyzed different audio captchas (From google site, digg and reCaptcha site) and estimated which one of them is more robust, using different Machine Learning algorithms (AdaBoost, SVM and k-Nearest Neighbour algorithms). In this process, they also understood the strengths and weaknesses of these algorithms as such too.

So, basically, in this paper, the focus was on understanding how much can a machine learn, in its attempt to break audio captchas. The experiment was done in two stages, like the image captchas: Speech segmentation and recognition. Machine Learning has been used on the captcha segments to perform automatic speech recognition (ASR). The features for this were generated using popular speech feature extraction techniques like – MFCC (Mel Frequency Cepstral Co-efficients), PLP (Perceptual Linear Prediction) and RASTA-PLP(relative spectral transform-PLP). These names appeared scary enough and I did not beyond them to find out more details :)

O my God, the creation of training data for these machine learning algorithms was a big exercise in itself. I began wondering about all the speech processing related experiments, involving huge amounts of training data. Patience man! patience! Finally, all their speech samples are transformed in to a set of segments of fixed size. They are all labelled either noise, digit or alphabet. These were used for training the classifiers, one type of captcha at a time.

So, I’ve blah-ed enough and hence, I’ll jump to conclusions:
1. An audio captcha that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers.
2. Captchas that have longer solutions and multiple speakers tend to be difficult to solve.
3. Having a large vocabulary will make it difficult to collect enough training data for those who aim to break these captchas.
4. To develop an audio captcha with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues.

And finally, it made an interesting read. Some cool work. :)

The paper can be read here.

Published in: on May 21, 2009 at 3:25 pm Comments (2)

Computers and iPhones and Mobile Phones, oh my!

I was going through this Google’s recent paper published in the WWW-’09 :

“Computers and iPhones and Mobile Phones, oh my! A logs-based comparison of search users on different devices.”

- Thought I’d share my thoughts post reading.

I am basically highly fluctuating on the merits of a particular usability study. One moment, it appears a very practical and useful study, one moment it appears a stunt, some other moment it appears like interesting but useless. When I began this paper, I began with the first mode, went through the other modes and finally, ended up in the first mode towards the end :)

So, what is this paper about? : As the name indicates, its a study on the search logs from three different kinds of devices – computers, iphones and mobile phones. The intention of this paper as it appears to me, is to understand the searcher behavior with different devices and also suggest some ideas to improve the mobile searching experience. This is the first of it kind study, which compares these 3 platforms all at a single place – interesting! The details of studies provided in related work also made me wonder : what will you learn knowing the average query length? What will one do with that?

Hmm, they have conducted this study over a period of time with different users on different devices, who used google search, and drew some conclusions from their research:
1. Query length is kinda similar with a computer and an iphone but lesser on a general mobile phone
(Atleast to me, it appeared obvious, because of the difficulty in typing using a mobile phone keyboard. However, no issues. They worked hard and proved it.)
2. Distribution of query categories on a mobile phone was less diverse compared to the other two, which were more or less similar.
3. Third one, though again might sound obvious, interested me. Comparitively, local search queries were to the same extent on computers as well as iphones and were obviously lesser than conventional mobiles. The explanation for this trend of lesser local queries on iphone was stated as: “users search for local content within an application that can provide a richer experience, if it is available” It indicates Maps application of iphone in this context.
4. Iphone searchers had the most diversity in information per user while conventional mobile had the least.
(I thought computers will have most diversity and mobiles will have least)
5. Computer users had the highest queries per session, followed by iphone users and conventional mobile users (Intuitively, obvious)
6. Frequent computer-based searchers had a much higher rate of return than frequent iPhone or mobile phone searchers. – which says that mobile search is still a secondary mode of search (Well, speaking entirely from an Indian mobile user perspective, thats obvious. Surprised to see that its true in general too).
7. Adult content search is less on mobiles.
- As I read through these conclusions, I was getting irritated. If you understand the inherent difficulty in text entry for mobiles vs computers, most of these conclusions appear obvious. Since Iphone has a QWERTY kbd, perhaps, iphone might be better and will be in the middle. If not for the desire to convert the mights to “is” I was feeling like – do we really need a WWW paper, especially from Google to do this?

Once this is done, there came the suggestions part, to improve user experience, which finally made me feel I did not totally waste my time reading this.

1. For conventional mobiles, it was suggested to use low diversity in queries and provide a quick fetching of likely queries, thereby increasing the target rate. Personalizing this process to the user’s interests might give a better experience too.
2. Since the search patterns on high-end mobiles like iphone are similar, attempts can be made to integrate the search experiences on both these platforms (Eg: “content that was searched for on a computer should be
easily accessible through mobile search” etc). Also, info from computer based search studies can be applied to iphone and the likes, to provide better experience on them.

I don’t know. Somethings wrong with me. Either I have become too caustic and pessimistic about usability studies or I am just not trying to realize the merit. I am feeling that these are the only two points worth noticing in that 10 page paper.

The paper can be read here.

Published in: on May 13, 2009 at 11:44 am Comments (4)

Microsoft Surface

“Microsoft Surface (Codename: Milan), is a multi-touch product from Microsoft which is developed as a software and hardware combination technology that allows a user, or multiple users, to manipulate digital content by the use of natural motions, hand gestures, or physical objects”

- Is how Wikipedia defines “Microsoft Surface“.

“Microsoft Surface turns an ordinary tabletop into a vibrant, interactive surface. It’s the first commercially-available surface computing platform from Microsoft. The product provides effortless access to digital content through natural gestures, touch and physical objects.”
- Is how the “Surface” team describes it.

Whatever might be the definitions, I can only say onething – its just amazing. I got a chance to play with it a couple of weeks back and it was love at first sight. Alas, there is no second sight yet. So, I can only bask in the past glory :)

So, why did I like it? Well, I was enchanted by the multi-user, multi-touch interface firstly. Next, it was the sheer beauty of the product. I was moving a kid’s image with my hand and the kid was jumping, walking and running according to my hand movement. This was in one of the things meant for children to create videos, perhaps. But, isn’t it equally attractive to adults? Oh, I was moving my hand around a image on screen and the picture elongated and compressed as per my hand! See some of the demo videos over here and you can understand what “Surface” can do :)

There is such a lot that one can do with this product. Wonder how it will be if this becomes accessible to everyone. Perhaps, it changes the way people use and interact with a computer, drastically. Ofcourse, its not available for individual customers. Its target audience are the commercial sites like – select AT&T retail locations, at the iBar located in the Rio All Suite Hotel , a casino in Las Vegas, at select Sheratons in the U.S., the Disney Innovations House in Anaheim, California and at Hotel 1000 in Seattle etc.

Published in: on April 29, 2009 at 12:17 pm Comments (10)

A CAPTCHA based on image orientation

I think I reached this WWW-09 paper (What’s Up CAPTCHA? A CAPTCHA Based On Image Orientation) through one of those google research’s blogs. Anyways, how I got it or how I read it is immaterial at the moment. What is making me write this post is the general idea of the paper. Its so simple and pleasing to hear – ofcourse, I am not belittling anybody’s imagination and technological competence. I am just saying – once its published, now, the idea appears like – “Oh! thats so obvious! why didn’t anybody try this before google?”. As usual, eh? ;)

So, it is about using image based CAPTCHAs, instead of text based ones. Language independence and freedom from text entry are stated as the conveniences. Ofcourse, to me image based captcha is more appealing and interesting – atleast to imagine. The motivation? Spam bots are becoming over intelligent in understanding text captchas. To minimize their effects, systems are complicating text captchas by twisting letters, increasing noise in the captcha etc, to make the job difficult for machines. Result – they became irritating to humans as well (I was one of those who got so irritated seeing some of those new generation captchas). Hence, CAPTCHAs got a new direction in trying image orientation.

Three basic tenets of captcha include – easy for humans to solve, hard for machines to solve and being easier to generate and evaluate. Now, using these, the problem is to work on image captchas. With the case of image captchas, choosing the right set of images is again an issue since some images can be oriented by machines easily (Eg: using face recognition etc). Some cannot be properly oriented even by humans, since theres no way of judging one to be the right orientation (Eg: Abstract art).

To summarize, the whole process can be summarized as:
1. Having an image orientation detection and understanding mechanism system firstly.
2. Then, picking up only selective set of images, by removing the computer detectable images from the list.
2. Removing even those images which are difficult for humans to orient.

Now that your system is ready – this is how its used.
1. You give an image for the user to orient.
2. Like text captchas, you give him access to next level only after he has oriented it to the right direction.

Google conducted 2 usability studies regarding this – A viability study and a happiness study, whose purposes can be understood by their names.

I felt its an interesting social experiment and worth sharing my thoughts on it at my blog. Waiting to see it live soon… :)

Details of the paper can be seen here.

Published in: on April 22, 2009 at 10:59 am Comments (6)

Darwinism and “Anti-Darwin”ism

Yes, the quotes there have some significance. By Darwinism, I meant what Darwin has proposed. By “Anti-Darwin”ism, I meant the act of being Anti-Darwin. That is, the second one is not exactly criticizing Darwinism. Its criticizing Darwin. No, its not a debate. I happened to see two documentaries yesterday.
1. Was Darwin wrong? – a National Geographic video.
2. Disasters Darwinism Brought to humanity
– It was interesting to see the two different poles back to back. This helped me see the stark contrast immedietly, in the way both the teams approached in conveying their thoughts.

Let me tell about the first video: Going by the title, I thought it was Anti-Evolutionary or Creationist. Well, let me confess, I have a strong sugary taste for Darwinism, though Creationism appeals too, albeit very rarely. Despite that, imagining that I will know the creationist view point, I chose to see the video. However, Open sesame, it proved to be a pro-evolutionary theory video! The documentary makers first took the three basic principles of Darwin’s theory “Origin of speices and proved their validity, atleast theoritically. They then went on to counter some of those arguments that Anti-Evolutionists cite against the “Theory of Evolution”. I felt they did a nice job of it, majorly because, they focussed on proving the validity of Darwin’s theory rather than lambasting the opponent theorists.

The second one, ah, disappointed me a lot. Though I expected a severe criticism of Darwinism, it was too irrational for me to understand. It blamed Darwin and Darwin’s theory for every thing from Colonialism to Communism. Well, perhaps, the makers might have been right in pointing out that the concepts of “White man’s burden”, “superiority of the White civilization” etc etc gaining support from Darwin’s “survival of the fittest” concepts. But, I did not understand the point in blaming Darwin and Darwinism for this. It was mis-interpreted by those who misused it. I see no difference between this and blaming Communism because a bunch of leaders misused it. (Actually, while watching another video called “Bloody History of communism” I lost the positive anti-communist feeling that was created in me in the first half to a more sympathetic reaction in the second half majorly because of this very reason – mere rebuke of Communism and communists without enough rational backup.)

I was wondering why people lose credibility despite the fact that they researched a lot – just by beginning to criticize others. Criticism isn’t as bad as it appears to be. But, irrational criticism is. I was trying to be balanced in my opinion, by watching all this pro as well as anti takes on the same topic. Looks like I am going to be biased just because one of the takes is too desperate to prove the other is wrong, instead of showing that they are right……

Published in: on March 12, 2009 at 3:07 pm Comments (4)