Ah, these duplicates!

I never imagined that a 1997 Technical Report of Digital Equipment Corporation – will remind me of my nightmares 🙂

Well, here is the report: Syntactic Clustering of the Web.
Authors: Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
Group: Digital’s, Systems Research Center.

Here is the abstract:
“We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.”

I just browsed through the document and began wondering about a few things:

1) Will those same experiments they did in 1997, work without bombing on the humongous amounts of data that the world wide web has, now?
– Me thinks, a 1997 altavista crawl is no way near to 2010 google crawl.

2) I don’t exactly understand the purpose of indexing even the duplicates and then clustering them, basically. I don’t want to index duplicates at all!
[There might be several compelling reasons to index duplicates too – but, I just can’t find them justified!]

3) On the process of removing duplicates: I think Solr Deduplication works with taking an md5hash or some such function of a document. At that rate, even an additional word or two, or a changed sentence order makes one document’s hash different from another, doesn’t it?.

I am just wondering how much of computational effort is needed, if a keyword based -hashing function is used to estimate the similarity between documents. Let me just take a very loose example and explain my doubt:

1)For each document you are about to index, extract the keywords (or whatever feature you prefer) -after some pre-processing (stemming etc, for example). Now, make a hash of this keywords, and save it as the document’s unique identifier.

2)For each new document, see if there is already a hash in the index. If its there, it means that this is a duplicate and need not be indexed (or put to be clustered..whatever!)

-Can’t this work instead of doing a hash on entire document?? (I am not thinking in terms of computational complexity and effort. I am just thinking in terms of what appears to make sense to me).

As I was writing this, I found this snippet, in a document titled: ‘Practical Web Crawling Issues‘, by Carlos Castillo. (more about it in another post!)

“We calculate a hash function of the contents of the pages to avoid storing the same content twice. To account for minor variations in theWeb pages, this hash function is calculated after the page have been parsed, so two pages with identical content but different colors or formatting will still be detected as duplicates. Note that this method only avoids storing the duplicate Web pages, it does not prevent downloading the page, and duplicate content can generate a waste of valuable network resources.

Recommendation: in our case, the crawler does not follow links from a Web page if the Web page is
found to be a duplicate; applying this heuristic, we downloaded just about 6% of duplicates.”

(Whille, the web has around 30% duplicates, according to a 1999 stanford paper which can be found in this document’s references)
-Interesting, but vague, like my thoughts, except for the ‘6%’ thing! 😛

Out of curiosity, I opened the 1999 Stanford paper, and found that it avoided crawling duplicates altogether! (More on this in the next post.)

Well, actually speaking, I am fine with google’s handling of duplicates most of the time. However, on times like now, I get frustrated, and my peanut brain starts asking such questions. 🙂

Published in: on November 7, 2010 at 8:00 am  Comments (5)  

The URI to TrackBack this entry is: https://vbsowmya.wordpress.com/2010/11/07/ah-these-duplicates/trackback/

RSS feed for comments on this post.

5 CommentsLeave a comment

  1. I think the purpose of the original paper is different in intent to the last one you have mentioned.

    I guess the reason for storing duplicates is to server at least one link to that doc if some of the links to that doc are broken (URN concept that they explained).

    Because they wanted to store the duplicates, they need to hash the whole doc rather than few key words. I don’t think elimination of duplicates entirely is also a good idea as it leads to SPOF.

    This is just my understanding / guess, did not go through that doc entirely. I may be wrong 🙂


  2. @Suryudu: you understood it right. For the first part.
    But, I just don’t like indexing duplicates, since they eat up a lot of network bandwidth..and also, they still show up un-necessarily in search results, to my irritation 🙂

    “Because they wanted to store the duplicates, they need to hash the whole doc rather than few key words”
    -Whole doc indexing is not necessary. and the keywords need not be ‘few’ either 🙂

  3. I don’t think they will be indexing duplicates, they will form a cluster of them just to satisfy the URN stuff, I guess (again, I guess :))

    I think they have explained some algo to identify if it is a duplicate of doc containing another e.t.c, I am not sure if you can achieve this by hashing the keywords, this way you can find the related articles but not exactly duplicates or docs containing the docs e.t.c. (not sure again :))

    looks like you are doing some interesting research 🙂


  4. @Suryudu: In this case, in the paper I mentioned in this article, as far as I remember, they are indexed. Whether they are indexed or not, they are still downloaded and read through. I think the one that I wrote about in my next post, does a good job compared to this, in doing the same work.

    Reg keyword matching: Your definition of keyword and mine seem to be different. Intuitively,it should be possible to identify two similar documents, which differ only in formatting and other such details, using such keyword matching instead of matching all words. However, doing in on webscale might pose some other scalability issues, about which I am not very aware of.

    I am not a researcher, yet.
    I don’t work in these areas either 🙂 Just reading out of general interest.

  5. For finding similarity keyword matching might work. but isn’t order important for identifying duplicates?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: