I never imagined that a 1997 Technical Report of Digital Equipment Corporation – will remind me of my nightmares 🙂
Well, here is the report: Syntactic Clustering of the Web.
Authors: Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
Group: Digital’s, Systems Research Center.
Here is the abstract:
“We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.”
I just browsed through the document and began wondering about a few things:
1) Will those same experiments they did in 1997, work without bombing on the humongous amounts of data that the world wide web has, now?
– Me thinks, a 1997 altavista crawl is no way near to 2010 google crawl.
2) I don’t exactly understand the purpose of indexing even the duplicates and then clustering them, basically. I don’t want to index duplicates at all!
[There might be several compelling reasons to index duplicates too – but, I just can’t find them justified!]
3) On the process of removing duplicates: I think Solr Deduplication works with taking an md5hash or some such function of a document. At that rate, even an additional word or two, or a changed sentence order makes one document’s hash different from another, doesn’t it?.
I am just wondering how much of computational effort is needed, if a keyword based -hashing function is used to estimate the similarity between documents. Let me just take a very loose example and explain my doubt:
1)For each document you are about to index, extract the keywords (or whatever feature you prefer) -after some pre-processing (stemming etc, for example). Now, make a hash of this keywords, and save it as the document’s unique identifier.
2)For each new document, see if there is already a hash in the index. If its there, it means that this is a duplicate and need not be indexed (or put to be clustered..whatever!)
-Can’t this work instead of doing a hash on entire document?? (I am not thinking in terms of computational complexity and effort. I am just thinking in terms of what appears to make sense to me).
As I was writing this, I found this snippet, in a document titled: ‘Practical Web Crawling Issues‘, by Carlos Castillo. (more about it in another post!)
“We calculate a hash function of the contents of the pages to avoid storing the same content twice. To account for minor variations in theWeb pages, this hash function is calculated after the page have been parsed, so two pages with identical content but different colors or formatting will still be detected as duplicates. Note that this method only avoids storing the duplicate Web pages, it does not prevent downloading the page, and duplicate content can generate a waste of valuable network resources.
Recommendation: in our case, the crawler does not follow links from a Web page if the Web page is
found to be a duplicate; applying this heuristic, we downloaded just about 6% of duplicates.”
(Whille, the web has around 30% duplicates, according to a 1999 stanford paper which can be found in this document’s references)
-Interesting, but vague, like my thoughts, except for the ‘6%’ thing! 😛
Out of curiosity, I opened the 1999 Stanford paper, and found that it avoided crawling duplicates altogether! (More on this in the next post.)
Well, actually speaking, I am fine with google’s handling of duplicates most of the time. However, on times like now, I get frustrated, and my peanut brain starts asking such questions. 🙂