The title instantly hit my brain and I began reading with, after a generous friend downloaded the restricted entry pdf and sent it to me.
The theme of this paper is on using the world wide web as a data source for various data-intensive tasks. Now, how is this related to the topic? Well, the best way to enter the WWW is a search engine! Can you see the light?
Some of the examples of this approach mentioned in the article are: using hit counts to identify likely translations of compositional phrases, Finding Synonyms, building models of noun-noun compound bracketing (what is that supposed to mean??) etc. There was also a team which worked on validating results from these experiments on WWW by comparing with human subjects. Ofcourse, I use world wide web and word counts for something else too – spell check! (Will come to this towards in the coming lines!)
However, there are few issues to this approach, as the paper says:
1)Firstly, to get a real estimation, we might have to give several queries, on the search engine. Taking the example mentioned in the paper, to estimate the frequencies of the word pair ‘fulfil obligation’, [Keller and Lapata] make 36 different queries, to cover all forms and all possibilities. Imagine a language with more inflections or varied constructions!
2) Search syntax is limited.
3)The limitations search engines impose, for querying their API
4) Search hits are for pages, not instances.
Strangely enough, the reasons I expected did not find a mention here:
1) The unreliable and inconsistent search engine counts – strange that this is not mentioned in the above reasons, but later in the paper! Here is a good article about this.
2) What about agglutinative languages?
3) This approach may not be great, after all. My strong objection : This will perpetuate errors. Let us say, a particular word is found in a small number on the web and it has a popular mis-spelling. What if the people who use the actual spelling write less on the web than the wrong ones? As time passes, the hits for the wrong ones increase.. and all our approaches of synonyms/spellings whatever -will yield us wrong results..and we in turn take them as right..and the wrongs count increases :-). Well, this was my experience a couple of times I tried relying on google search counts, for checking spellings of a few Telugu words..hence, the concern.
Yes, there was also a discussion on the presence of too many duplicate pages and too much of spam. Bah, I hate those duplicate pages – I had to invent all sorts of ugly workarounds in our project, to avoid duplicates being shown in the results, at a big cost. Duplicates, I think are a big issue, even now, even in Google. Anyone who proceeds beyond page-1 of google search results, can know that :-). I’ve been giving wierd queries these days, in which I saw results from same domain, sometimes in 3-4 results out of top-10. Ofcourse, pages from same domain don’t count as duplicates. But after seeing “view results from same domain” link at a result, you don’t really wish to see results from same domain, on same results page, do you??
The alternative for language researchers queries, according to this paper, is to build a search engine, working around these issues. They actually tried this and prepared web corpora for German and Italian, which is publicly accessible. Their hope is that collaborative effort of research community might be able to reach the efficiency level of a commercial search engine.
To me, data cleaning appears to be an interesting problem. But, apparently, the research community doesn’t feel so, according to what the paper cites 🙂 Understanding that ‘data cleaning’ is perhaps considered unworthy to publish, ‘CLEANEVAL’ project was setup in this direction it seems (I did not read about it, yet!). Ultimately, the aim is to develop a web-scale, commercial quality, low-noise corpus which can be used by linguistic and language technology researchers in their experiments.
Now comes the issue, which a cynical person like me would emphatically answer with a big NO!
‘Can the research community compare with Industry, in terms of data collection and cleaning?’
In my view (with my limited knowledge) – Not even God can surpass Google in terms of the amount of data! 🙂
And unless you have a HUGE corpus, you can’t expect great results, in these data-intensive tasks.
The article ends in a inconclusive note, citing two options that research community has:
1) Giving up, saying that they can never compete with google/microsoft/yahoo/altavista etc.
2) Collaborating and trying to match the industry gaints
While calling it bad, the article is actually trying to emulate the same approach…Hmm, I think the title of this article should have been ‘Googlelogy is bad science. Is it?’
The article details:
Googleology is bad science, A.Kilgariff, 2007. Computational Linguistics 33 (1): 147-151. people with ACM access can download it here.
And the article’s references, in case anyone’s interested:
Baroni, Marco and Adam Kilgarriff. 2006. Large linguistically-processed web corpora for multiple languages. In Proceedings of European ACL, Trento, Italy.
Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks, 29(8–13):1157–1166.
Grefenstette, Gregory. 1999. The WWW as a resource for example-based MT tasks. In ASLIB Translating and the Computer Conference, London.
Keller, Frank and Mirella Lapata. 2003. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459–484.
Nakov, Preslav and Marti Hearst. 2005. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proceedings of the Ninth
Conference on Computational Natural Language Learning (CoNLL-2005),pages 17–24, Ann Arbor, Michigan.
Ravichandran, Deepak, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed
noun clustering. In Proceedings of ACL,Ann Arbor, Michigan.
Snow,Rion, Daniel Jurafsky, and Andrew Ng. 2006. Semantic taxonomy induction from heterogenous evidence. In Proceedings of ACL, Sydney.
Turney, Peter D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning, pages 491–502.