I know, I am going beyond the scope of the way I wanted my blog to be when I began it. But, three years is a long time and thoughts run wild. 🙂 I know, perhaps I am going beyond the scope of my readers’ expectations too. But, this is my only mouthpiece 🙂
I heard this name called: “Kneser-Ney Smoothing” the other day and got curious to know what it is. (To understand what Smoothing means, go to the wiki page here). When I did my introductory course in Natural language processing, I came to know about smoothing for statistical language modelling (in late 2006). Back then, as far as I remember, we read about only three things:
1. Add one smoothing
2. Turing estimation
3. Witten-Bell smoothing
– I ofcourse dont remember anything but the names, though :p
Okay, in my over-curiosity to know more about Kneser-Ney smoothing, I found this presentation on smoothing techniques in NLP by Bill MacCartney. Again, since I don’t need anything from this specifically, at the moment, I just browsed through the PPT to understand the gist of it. Heres what I learnt on different smoothing mechanisms:
1. Add-one smoothing is the simplest possible and least effective one – as the textbooks say.
2. There is something called Additive smoothing, which, from my understanding is similar to Add-one smoothing, except that here, its not add-1 but add-some number between 0 and 1.
3. Good Turing estimation – which aims at re-allocating the probabilities of those n-grams that occur (r+1) times in training data to the mass of n-grams that occur r-times. [If you are more curious, and need an example, see the PPT slides. If you extra curious, go ahead and visit the wiki page]
4. Jelinek-Mercer smoothing – is a kind of interpolative method. Quoting from the ppt: “nth-order smoothed model is defined recursively as a linear interpolation between the nth-order ML model and the (n − 1)th-order smoothed model.” Did not realize that theres a name for this 🙂 I used a similar idea sometime back, without realizing that theres a name for it!
5. Katz Smoothing – This can best be explained by using the words from the PPT again:
“Count mass subtracted from nonzero counts is redistributed among the zero-count bigrams according to next lower-order distribution”
6. Witten Bell Smoothing – A kind of Jelinek Mercer smoothing.
7. Absolute Discounting – From what I understood, this is a kind of hybrid between interpolation kind of smoothing like Witten bell and discounting kind of smoothing like Katz.
8. Kneser-Ney Smoothing – the one for which this search began! this was described as an extension of absolute discounting. Anyways, after reading all these, I actually forgot that my search began with this name!
There were some comments on interpolation kind of smoothing vs discounting kind of smoothing – compare and contrast stuff.
On the whole, educative. Would have been interesting if I were actually a student of this course now, and was asked to implement all these programatically and write a report with performance analysis 🙂