Jurafsky’s Introduction to NLP

Well, its a considerable break from blogging for a compulsive blogger like me ;) …. not that I don’t have anything interesting… but its just that I got disinterested in online activities. What prompted me to write today was Prof Jurafsky’s 15min talk introducing Natural Language Processing.

I have been having a feast of sorts since October 2011, starting with Prof Andrew Ng’s Machine Learning course. Now, I am spoilt for choice, with the advent of Udacity and Coursera. Amidst all these choices, I wasn’t really keen on attending the “Introduction to NLP” class, despite the fact that its being taught by people like Prof Jurafsky and Prof Manning!

However, a friend’s mail made me visit the course website today and I wanted to see the introductory lecture. I am just fascinated by the way Prof Jurafsky introduced NLP to the world (…..of the site visitors!). From the moment I started the video till now – I must have said – “Wow! What an intro!” several times, with several people, along with a “please watch this video” request. If someone introduced something in this manner, I will perhaps be tempted to take that course, whatever it is! In Jurafsky’s magic, I, who decided not to take this course, listened to all the uploaded lectures on the site so far!!

Irrespective of your background, I would strongly recommend this video to people who are curious about -
1) What Natural Language Processing means?
2) Where is it used?
3) (For Telugu readers) What is it that I tried to explain last year, in this ?? ;) (Yeah, shameless inter-linking!)

e-education rox!

(Okay, since its officially downloadable, I’ve uploaded this video that floored me, on dropbox, so that anyone who doesn’t have a login on course website can also view this video. Here it is. If this uploading is a crime, let me know…I will un-crime myself!)

Published in: on March 9, 2012 at 9:06 pm  Comments (1)  

Online Readings-4

(Disclaimer: This is dedicated to some Technology stuff I read and found interesting over the past two weeks..)

A couple of weeks back, I saw this blog post called “The evolution of search in six minutes”, with a video attached. To me, it is a very well-made, down to earth post on, well, the evolution of search as we use it now…and how it might be in the future. I feel its a must watch for everyone irrespective of their search-technology awareness.

And yes, although I cannot understand a bit about “Quantum Computing”, this article on “Machine Learning with quantum algorithms”, on google research blog, made an interesting read. May be, at some point of time, if I can manage to understand the science behind it, I might try to put it forth in a more common-manish way :P . Thanks to Praneeth for sharing this.

For a couple of days in the past fortnight, I was addicted to the idea of “Forensic Linguistics”. I was searching for some Applications of linguistics (rather…Applied Linguistics) and this video, although not “extraordinarily” shot, made me more curious about the subject. I even managed to get a book introducing the concepts of Forensic Linguistics (online)…but, its so easy to drift when you don’t know so many things ;) :P

Although I was fascinated by “Wolfram Alpha” when it was first announced, I soon drifted…well…I just told you why. But again, this particular blog post on Wolfram Alpha site, on handling/solving permutations related queries, fascinated me a lot :)

And then, I work in applying Language Technologies for Language Learning. I found something similar, but not related to “Language”, on the Communications of the ACM pages today. It is titled “Massive scale data mining for Education”. It talks about the use of massive math-query logs of students to predict clusters of difficult math problems, ways to track the progress of students etc. Yes, like they say, if it works, it will really be a massive scale student modeling. I loved the way the article ended. “We experiment, try different students on different problems, discover which exercises cause similar difficulties, and which help students break out of those difficulties. We learn paths in the data and models of the students. We learn to teach.” – It’d be interesting to see how it goes! :)

Back to CACM again, this piece on the duality of defining computing as an “instrument for human mind” (of a human mind and by a human mind too?? :P )… and connecting it to Steve Jobs and Dennis Ritchie ….was interesting too.

Lastly, again on CACM, there was this extremely insightful and well-written article on “Natural” Search User Interfaces by Marti Hearst. I just loved reading it and would recommend it to everyone interested in “Search”. Its a bit long…but surely worth it, IMHO.

Jai Hind, for now!!

Published in: on December 20, 2011 at 1:43 am  Comments (2)  
Tags:

On Solr

First things first. What is Solr?

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites. “
-Says the Apache Solr website. I can’t explain it better than this, anyways.

I first used Solr in Jan 2010 and have been using it at various stages of my professional as well as student life, ever since. To say the least, its amazing! :) I was just browsing through this book – “Solr 1.4 Enterprise Search Server” (by David Smiley and Eric Pugh) today. Oh, well, I should have read it in 2010 ideally. But, I found it now. There was one more which I found a couple of weeks ago, “Apache Solr 3.1 Cookbook” by Rafal Kuc. I thought it would be good to put the set of related links that I found useful over a period of time, so that others like me, who want to begin with Solr, might find them useful too. There might be N such posts on the web. But, this is partly for my own log too!

1) I remember starting at this presentation on “What is Solr?” by Tom Hill (courtesy: my google bookmarks). It might be a bit dated now, though.

2) This article by Grant Ingersoll, at IBM Developer works, was another useful one, back then.

3) Solr Wiki is the best source for everything, ofcourse. (http://wiki.apache.org/solr)

4) “Seven Deadly Sins of Solr” on LucidImagination pages, was another one, which helped me a lot during the initial days. In general, LucidImagination has some really good articles on Solr, for users of all stages of Solr experience.

5) In my view, setting Solr in Eclipse, with Tomcat is a real irritating issue. Well, it’s no difficult. Its straight forward. But, I don’t (perhaps won’t) understand Tomcat behavior in Eclipse, personally. This blogpost gives an overview of what to do to make a test Solr server work in your eclipse.

6) I was wondering about the crawling part of Solr. I always used external crawlers (Heritrix, Wget and even 80legs etc) to get required data. But there seem to be decent amount of options, which I did not try out, yet. Some of them -
a) Using Nutch with Solr – a LucidImagination article (I did play around with Nutch a while ago, but I did not know about Solr back then!)
b) Youseer, which uses Heritrix to crawl and Solr to index.
c) Crawl-Anywhere, another thing that I read about recently, which uses their own crawler, with Solr Indexer.

7) Here is a good example of reading Solr results, from SolrJ, a java library for Solr. I particularly cite this because, I had a tough time in finding an example on using SolrJ to read Solr results :) . Mats Lindh’s post was also a very useful one to use Solrj.

The best part of this is, I need not read Solr results in java alone. I can develop my webapp in any language I want and Solr supports it! (Well, I know that people use Java, Ruby and Python. Since Solr supports JSON, I guess it can be used in other languages too, perhaps).

8 ) The best place to go, for any query about Solr is the Solr users group and its archives :) Its a very friendly group, where you can always be assured that you will get a response.

9) I tried using the Solr-UIMA integration, but it did not work my UIMA annotators. So, I left it there and managed to do the UIMA stuff before sending the doc to Solr :) . It might look like “running away from the problem”, but it works.

10) Time and again, I am amazed at the new stuff I discover about Solr. The books I mentioned above, are really good sources of information. You can perhaps manage to find pirated pdfs online. I don’t want to give links myself, here. :)

11) Oh, I did mention links on setting up Solr to work, Knowing more about Solr and Accessing the search results of Solr using SolrJ. But, I did not mention about indexing a set of documents. I started with an example code by Grant Ingersoll. I can’t place that right now, but found this Solr-user group mail, which can serve as an example. Here is one more from Lucid Imagination, which uses TIKA parser.

Happy Solr-ing.
(Hoping that someday, someone like me, will find this post useful! Perhaps, this is an attempt to give back to the community! :P )

Published in: on August 31, 2011 at 6:54 pm  Comments (2)  

Readings: Amazon Mechanical Turk – Gold mine or Coal mine?

First things first:
Title of the article: Amazon Mechanical Turk: Gold Mine or Coal Mine?
Authors: Karen Fort, Gilles Adda, K.Bretonnel Cohen
In: Computational Linguistics Journal, June 2011, Vol. 37, No. 2, Pages 413-420

In the past 2 months or so, I guess I read this article 2-3 times. I still don’t understand why it was written :( Oh, the incomprehensibility was not the reason why I read it again, though. I found it multiple times, lying on my table..for various reasons (which include piling up of stuff you never read!)

Coming to the point:
1) I never used amazon mechanical turk – neither as a requester nor as a turker. But I don’t feel that there is something wrong with the approach. Its up to the turkers to do or not do a given task. So, I don’t think there should be question of ethics here. If there is anything, there should be a question on the quality.

2) If some turkers use it to meet their basic needs, (IMHO) its not the wrong-doing of amazon or the requesters.

So, if the concern in the article was about the quality of linguistic resources developed, perhaps, It might have sounded so abnormal to me. But, the issue was on the “working conditions” of the AMT “workers”. Whatever way people are using it (as a hobby, timepass, for some pocket money, to meet living expenses etc), its people who do that…and neither Amazon nor the task givers on AMT promise employment, right?

And hence the confusion… :)
Ofcourse, now, I won’t do a re-reading of the article! :P

Published in: on July 1, 2011 at 1:58 pm  Leave a Comment  

Automatic Adaptation of Dynamic Second language reading texts

Title: Automatic Adaptation of Dynamic second language reading texts
Author: Michael Walmsley

The tool can be accessed here.

Let me tell you, I am not reviewing the paper here. There are two reasons I started posting this.
1. I liked the FlaxReader tool, and began believing since yesterday that it will help in my German lessons.
2. Some of the statements and conclusions in this paper – attracted, amused, surprised (not all these emotions are for one statement!) me. So, thought I’d post them here.

Firstly, this paper begins with the idea that “Extensive Reading” (ER) is an effective Language Acquisition strategy. So, in this paper, there is a discussion on providing reading material, automatically adapting it to computer adapted ER. The “This research aims to make improvements in how people can learn languages” part attracted me the most, though I read this paper for professional reasons, and not as a beginner German learner.

As they say in Telugu – Swami Kaaryam, Swakaaryam…. :P

So, what are those statements that caught my attention?

* “Nation[3] states that students need to read close to 500,000 words per year, which is equivalent to one and a half substantial first year text books or six novels.”
(Nation[3] here is… Nation I.S.P, 2009, Teaching ESL/EFL reading and writing.)
- I did not go to that to see how they made those conclusions (Yet!)…but, it was interesting to note that. I wonder if my vocab, especially in Telugu..shows any improvement, despite the fact that I read many Telugu books, during an year! :P

* “Dictionaries should only be used for frequently occurring unknown words that prevent overall understanding. Literature suggests that comprehension is difficult when more than 5% of the words in a text are unfamiliar to a learner.”
(Again from Nation[3])
-Well, I am wondering at how the stats were arrived at. Nevertheless, its also an interesting fact to know.

There was this statement, which is so true…but at the same time, its structure made me ROFL:
“Elaboration is a text simplification technique”

“Vocabulary researchers have proposed that English learners must minimally know the most frequent 3000 to 5000 word families in order to effectively read authentic English text.”[24]
(The [24] is Cobb T, 2007, Computing the vocabulary demands of L2 reading)

“Learners can acquire new L2 words and strengthen existing vocabulary knowledge, by reading L1 texts with some words replaced by L2 equivalents” (L2 here is the language you are learning and L1 is the one you are good at).
-This particular statement summarizes what the flax reader tool does. I don’t know how the learnability is evaluated…but I guess this will be a good exercise for learners (like me!)

The rest of the paper discusses on what L1 words should be replaced by L2 equivalents, in a chosen text, how? what aids can be used? etc.

And then, there are others in the references, that attracted me. May be I will read some of them sometime soon….

Published in: on May 25, 2011 at 2:10 pm  Leave a Comment  

Readability and Indian Language texts

I got interested in computing the readability of texts recently. I have been suffering from a readers bloc for the past three weeks or so. So, I was not able to read much on this subject, though I skimmed through quite a bit of material. Well, I did find some works on creating a language models to estimate text difficulty. But, majorly, most of the stuff I found were based on conventional readability measures, which must have had their golden jubilee too, in the past decade.

Then, I came across this decade old paper:
“Living off the land: The web as a source of practice texts for learners of less prevalent languages”

This was on finding out right texts from the web, to provide learning material for second language learners of Nordic Languages. I was fascinated by the official title of the project described in the paper: “Corpus based language technology for computer assisted learning of Nordic languages”.

So, what is this about?
To summarize briefly, here is what they do:
1. The user supplies example text, in the form of a URL.
2. The text this URL is evaluated for the readability and other language factors.
3. Along with these statistics, the user is also presented with ten possible query terms, from that document.
4. The user can choose what query terms can be sent to the search engine (Evreka)
5. Again, each of those result documents are evaluated for the statistics of (2), and the user is presented with the results, along with a brief summary about each document result.

Well, I don’t understand the motive behind asking the user to give a URL and then doing all this. However, that’s not why I am writing this post.

I was left wondering (sitting in the banks of Neckar, with pigeons playing in front of me … in big numbers) about the readability of Indian language texts. Did anyone attempt at experimenting with that? Will these traditional readability measures that are seen about English work well for Indian languages? Is it necessary to think about Computer Assisted Language Learning for, say, Telugu? Can Telugu be called an LPL (Less Previleged Language)?

Well, “What is readability?” “How can someone measure general readability, isn’t there a personalized angle there?” – are my perpetual questions, though. They are not specific to Indian Languages.

However, atleast if you let the imagination run high, there is a very interesting scope to use this practically, in the Language teaching domain (IMHO). Perhaps, I’d write better if I can get over my readers bloc :)

Published in: on April 19, 2011 at 7:00 am  Leave a Comment  

On Practical Web Crawling issues

Practical Web Crawling Issues by Carlos Castillo.

I don’t remember how I came to this document. I think I was looking for some website copier-like crawlers, which I can use to run crawls to get a few 100s or 1000s of webpages from some seed websites and google showed me this.

As I browsed through this, it appeared to me as well written and comprehensive despite the fact that the issues are kind of obvious when you think of a crawler. I was never involved in designing a crawler, though I used a couple of crawlers over a period of time. To me, it provided a good overview. Once I began to write about this, I tried to find out who the author is, and here he is :-)

“Carlos Castillo is a research scientist at Yahoo! Research, Barcelona. He received his Ph.D from the University of Chile in 2004, and was a visiting scientist at Universitat Pompeu Fabra (2005) and Universita di Roma La Sapienza (2006). At age 33 (2010), he is an active researcher with over 30 publications in international conferences and journals in the areas of Web search and mining. He has served in the program committee of most major conferences in his area (WWW, WSDM, SIGIR, CIKM, etc.), and co-organized the Adversarial Information Retrieval Workshop and Web Spam Challenge in 2007 and 2008, and the ECML/PKDD Discovery Challenge 2010. His current research interest is the mining of content, links and usage data from the Web.”
-That bio says it all, about the credentials of the document :P

Coming to the point, this is a brief ‘executive’ summary of the document.

Practical web crawling issues can be broadly classified in to 6 categories:
1)Networking issues (variable network quality, server admins concerns, inconsistent firewall settings)
2)Massive DNS Resolving (crashing of local DNS servers, Temporary DNS failures, Malformed DNS records, Wrong DNS records, Use of WWW prefix)
3)HTTP Implementations(Accept Headers not honored, Range Errors, Response Lacking Headers, Found where you mean error, Wrong dates in headers)
4) HTML Coding (Wrong markup, Physical over logical content representation)
5) Webcontent Characteristics (blogging,mailinglists and forums, Duplicate Detection)
6) Server Application Programming (Embedded Session Ids, Repeated Path Components, Slower/Erroneous pages)

The best part about this document is that it is not an enumeration. It also suggests some solutions wherever possible. Most of us need not write a crawler at all, even when we need one. However, if we use a crawler, I think it helps to understand some of these issues. Taking an extreme stance, I would say that every search engine user ought to know more about crawling than what he/she knows now. I would not say that this is the only document, but, this is a good one to know more, for sure.

I think I would read more from this thesis soon, whenever I feel compelled to do so ;)

Published in: on November 12, 2010 at 8:00 am  Comments (2)  

Finding replicated web collections (1999)

I got in to reading more than a decade old papers. Comparatively, I do get a feeling of ‘old ness’ as I read these things. But, these papers provide good overviews of the problems in web information retrieval and possible application scenarios to the solutions.

(By reading – I mean, a careful reading and understanding of the overall intentions of the paper and not analysing and validating its math etc. I have only a generic interest in the subject)

Finding replicated web collections – is a 25 page paper (:O) from Stanford-google, by Junghoo Cho, Narayanan Shivakumar and Hector Garcia-Molina

In brief: This paper, as the name suggests, deals with finding replicated documents on the web. The approaches proposed are scalable to tens and millions of pages of data.

The intentions in finding replicated documents is multifold. As the ‘introduction’ section summarizes, some of the web search engine tasks can be performed more effectively by finding out replications.

1) A crawler’s job can be finished quickly by making it skip replicated content. This also saves a lot of bandwidth.

2) A page that has many copies is perhaps more important than others. So, search results can be ranked by this factor.

3) For archiving, pages with more duplicates can be given priority, incase there are some storage constraints, since they are ‘important’ and hence are of more archival value.

4) I did not quiet get this point, but here it is in their own words: “cache holds web pages that are frequently accessed by some organization. Again knowledge about collection replication can be used to save space. Furthermore, caching collections as a whole may help improve hit ratios (e.g., if a user is accessing a few pages in the LDP, chances are he will also access others in the collection).”

The 3 main issues involved in this are:
1)Defining a notion of ‘relativeness’

2)Finding methods to identify such related documents

3)Exploiting this replication

Major difficulties in finding out the replicated documents are:
1) Sometimes, the replications are updated at different points of time. During the crawl time, they might not look similar because of this.

2) Sometimes, mirror collections are only partial copies.

3) The data formats of mirrors and original may be different

4) The sanpshots compared may be of partial crawls.

What is similarity anyways? How should we define it?
A major contribution of this paper is that this paper defines a similarity measure, uses this and improves the crawl quality and efficiency using the similarity information.

Major part of the rest of the paper went in to defining, calculating and evaluating various similarity measures, Though this was what I wanted to read, when I began this paper, other things mentioned here began interesting me and I skipped this part! Post this section, they mentioned how they applied this to crawling and searching – and it was about its application in crawling that I got interested in, over the course of reading this paper.

Well, Google’s crawler back then had on 25million pages itseems (!! seems like pre historic age !!). Taking this crawled data, they analyzed it for the amount of duplicate content, and understood that almost 48% of the crawl had duplicate content (‘Duplication’ here does not mean a 100% duplication.) However, taking this data, they precoded information on widely replicated collections and machines with multiple names in to Google’s crawler and repeated a crawl. The duplication now was only 13%, despite the fact that the crawl size now was 35million pages!
[Interesting.... and I can see a sci-fi fantasy where duplicates, which are creating confusion and havoc in the bit world are being checked by the geeks - a war ensues and finally, for now geeks won, and the duplicate bits vow for a revenge.. part-2 continues! :P ]
-Likewise, each crawl makes a collection of information on replicated content, which won’t be crawled in future.

Apart from the above mentioned paragraph, I found the whole of the introductory sections extremely readable.

Reading ‘History’ is good, anyways! :-)

Published in: on November 10, 2010 at 8:31 am  Comments (1)  

Ah, these duplicates!

I never imagined that a 1997 Technical Report of Digital Equipment Corporation – will remind me of my nightmares :-)

Well, here is the report: Syntactic Clustering of the Web.
Authors: Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
Group: Digital’s, Systems Research Center.

Here is the abstract:
“We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.”

I just browsed through the document and began wondering about a few things:

1) Will those same experiments they did in 1997, work without bombing on the humongous amounts of data that the world wide web has, now?
- Me thinks, a 1997 altavista crawl is no way near to 2010 google crawl.

2) I don’t exactly understand the purpose of indexing even the duplicates and then clustering them, basically. I don’t want to index duplicates at all!
[There might be several compelling reasons to index duplicates too - but, I just can't find them justified!]

3) On the process of removing duplicates: I think Solr Deduplication works with taking an md5hash or some such function of a document. At that rate, even an additional word or two, or a changed sentence order makes one document’s hash different from another, doesn’t it?.

I am just wondering how much of computational effort is needed, if a keyword based -hashing function is used to estimate the similarity between documents. Let me just take a very loose example and explain my doubt:

1)For each document you are about to index, extract the keywords (or whatever feature you prefer) -after some pre-processing (stemming etc, for example). Now, make a hash of this keywords, and save it as the document’s unique identifier.

2)For each new document, see if there is already a hash in the index. If its there, it means that this is a duplicate and need not be indexed (or put to be clustered..whatever!)

-Can’t this work instead of doing a hash on entire document?? (I am not thinking in terms of computational complexity and effort. I am just thinking in terms of what appears to make sense to me).

As I was writing this, I found this snippet, in a document titled: ‘Practical Web Crawling Issues‘, by Carlos Castillo. (more about it in another post!)

“We calculate a hash function of the contents of the pages to avoid storing the same content twice. To account for minor variations in theWeb pages, this hash function is calculated after the page have been parsed, so two pages with identical content but different colors or formatting will still be detected as duplicates. Note that this method only avoids storing the duplicate Web pages, it does not prevent downloading the page, and duplicate content can generate a waste of valuable network resources.

Recommendation: in our case, the crawler does not follow links from a Web page if the Web page is
found to be a duplicate; applying this heuristic, we downloaded just about 6% of duplicates.”

(Whille, the web has around 30% duplicates, according to a 1999 stanford paper which can be found in this document’s references)
-Interesting, but vague, like my thoughts, except for the ’6%’ thing! :P

Out of curiosity, I opened the 1999 Stanford paper, and found that it avoided crawling duplicates altogether! (More on this in the next post.)

Well, actually speaking, I am fine with google’s handling of duplicates most of the time. However, on times like now, I get frustrated, and my peanut brain starts asking such questions. :-)

Published in: on November 7, 2010 at 8:00 am  Comments (5)  

Googleology is bad science!

The title instantly hit my brain and I began reading with, after a generous friend downloaded the restricted entry pdf and sent it to me.

The theme of this paper is on using the world wide web as a data source for various data-intensive tasks. Now, how is this related to the topic? Well, the best way to enter the WWW is a search engine! Can you see the light?

Some of the examples of this approach mentioned in the article are: using hit counts to identify likely translations of compositional phrases, Finding Synonyms, building models of noun-noun compound bracketing (what is that supposed to mean??) etc. There was also a team which worked on validating results from these experiments on WWW by comparing with human subjects. Ofcourse, I use world wide web and word counts for something else too – spell check! (Will come to this towards in the coming lines!)

However, there are few issues to this approach, as the paper says:

1)Firstly, to get a real estimation, we might have to give several queries, on the search engine. Taking the example mentioned in the paper, to estimate the frequencies of the word pair ‘fulfil obligation’, [Keller and Lapata] make 36 different queries, to cover all forms and all possibilities. Imagine a language with more inflections or varied constructions!

2) Search syntax is limited.

3)The limitations search engines impose, for querying their API

4) Search hits are for pages, not instances.

Strangely enough, the reasons I expected did not find a mention here:
1) The unreliable and inconsistent search engine counts – strange that this is not mentioned in the above reasons, but later in the paper! Here is a good article about this.
2) What about agglutinative languages?
3) This approach may not be great, after all. My strong objection : This will perpetuate errors. Let us say, a particular word is found in a small number on the web and it has a popular mis-spelling. What if the people who use the actual spelling write less on the web than the wrong ones? As time passes, the hits for the wrong ones increase.. and all our approaches of synonyms/spellings whatever -will yield us wrong results..and we in turn take them as right..and the wrongs count increases :-) . Well, this was my experience a couple of times I tried relying on google search counts, for checking spellings of a few Telugu words..hence, the concern.

Yes, there was also a discussion on the presence of too many duplicate pages and too much of spam. Bah, I hate those duplicate pages – I had to invent all sorts of ugly workarounds in our project, to avoid duplicates being shown in the results, at a big cost. Duplicates, I think are a big issue, even now, even in Google. Anyone who proceeds beyond page-1 of google search results, can know that :-) . I’ve been giving wierd queries these days, in which I saw results from same domain, sometimes in 3-4 results out of top-10. Ofcourse, pages from same domain don’t count as duplicates. But after seeing “view results from same domain” link at a result, you don’t really wish to see results from same domain, on same results page, do you??

The alternative for language researchers queries, according to this paper, is to build a search engine, working around these issues. They actually tried this and prepared web corpora for German and Italian, which is publicly accessible. Their hope is that collaborative effort of research community might be able to reach the efficiency level of a commercial search engine.

To me, data cleaning appears to be an interesting problem. But, apparently, the research community doesn’t feel so, according to what the paper cites :-) Understanding that ‘data cleaning’ is perhaps considered unworthy to publish, ‘CLEANEVAL’ project was setup in this direction it seems (I did not read about it, yet!). Ultimately, the aim is to develop a web-scale, commercial quality, low-noise corpus which can be used by linguistic and language technology researchers in their experiments.

Now comes the issue, which a cynical person like me would emphatically answer with a big NO!
‘Can the research community compare with Industry, in terms of data collection and cleaning?’
In my view (with my limited knowledge) – Not even God can surpass Google in terms of the amount of data! :-)
And unless you have a HUGE corpus, you can’t expect great results, in these data-intensive tasks.

The article ends in a inconclusive note, citing two options that research community has:
1) Giving up, saying that they can never compete with google/microsoft/yahoo/altavista etc.
2) Collaborating and trying to match the industry gaints

While calling it bad, the article is actually trying to emulate the same approach…Hmm, I think the title of this article should have been ‘Googlelogy is bad science. Is it?’

The article details:
Googleology is bad science, A.Kilgariff, 2007. Computational Linguistics 33 (1): 147-151. people with ACM access can download it here.

And the article’s references, in case anyone’s interested:

Baroni, Marco and Adam Kilgarriff. 2006. Large linguistically-processed web corpora for multiple languages. In Proceedings of European ACL, Trento, Italy.

Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks, 29(8–13):1157–1166.

Grefenstette, Gregory. 1999. The WWW as a resource for example-based MT tasks. In ASLIB Translating and the Computer Conference, London.

Keller, Frank and Mirella Lapata. 2003. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459–484.

Nakov, Preslav and Marti Hearst. 2005. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proceedings of the Ninth
Conference on Computational Natural Language Learning (CoNLL-2005),pages 17–24, Ann Arbor, Michigan.

Ravichandran, Deepak, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed
noun clustering. In Proceedings of ACL,Ann Arbor, Michigan.

Snow,Rion, Daniel Jurafsky, and Andrew Ng. 2006. Semantic taxonomy induction from heterogenous evidence. In Proceedings of ACL, Sydney.

Turney, Peter D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning, pages 491–502.

Published in: on November 4, 2010 at 8:58 pm  Comments (2)  
Follow

Get every new post delivered to your Inbox.

Join 37 other followers