I got in to reading more than a decade old papers. Comparatively, I do get a feeling of ‘old ness’ as I read these things. But, these papers provide good overviews of the problems in web information retrieval and possible application scenarios to the solutions.
(By reading – I mean, a careful reading and understanding of the overall intentions of the paper and not analysing and validating its math etc. I have only a generic interest in the subject)
Finding replicated web collections – is a 25 page paper (:O) from Stanford-google, by Junghoo Cho, Narayanan Shivakumar and Hector Garcia-Molina
In brief: This paper, as the name suggests, deals with finding replicated documents on the web. The approaches proposed are scalable to tens and millions of pages of data.
The intentions in finding replicated documents is multifold. As the ‘introduction’ section summarizes, some of the web search engine tasks can be performed more effectively by finding out replications.
1) A crawler’s job can be finished quickly by making it skip replicated content. This also saves a lot of bandwidth.
2) A page that has many copies is perhaps more important than others. So, search results can be ranked by this factor.
3) For archiving, pages with more duplicates can be given priority, incase there are some storage constraints, since they are ‘important’ and hence are of more archival value.
4) I did not quiet get this point, but here it is in their own words: “cache holds web pages that are frequently accessed by some organization. Again knowledge about collection replication can be used to save space. Furthermore, caching collections as a whole may help improve hit ratios (e.g., if a user is accessing a few pages in the LDP, chances are he will also access others in the collection).”
The 3 main issues involved in this are:
1)Defining a notion of ‘relativeness’
2)Finding methods to identify such related documents
3)Exploiting this replication
Major difficulties in finding out the replicated documents are:
1) Sometimes, the replications are updated at different points of time. During the crawl time, they might not look similar because of this.
2) Sometimes, mirror collections are only partial copies.
3) The data formats of mirrors and original may be different
4) The sanpshots compared may be of partial crawls.
What is similarity anyways? How should we define it?
A major contribution of this paper is that this paper defines a similarity measure, uses this and improves the crawl quality and efficiency using the similarity information.
Major part of the rest of the paper went in to defining, calculating and evaluating various similarity measures, Though this was what I wanted to read, when I began this paper, other things mentioned here began interesting me and I skipped this part! Post this section, they mentioned how they applied this to crawling and searching – and it was about its application in crawling that I got interested in, over the course of reading this paper.
Well, Google’s crawler back then had on 25million pages itseems (!! seems like pre historic age !!). Taking this crawled data, they analyzed it for the amount of duplicate content, and understood that almost 48% of the crawl had duplicate content (‘Duplication’ here does not mean a 100% duplication.) However, taking this data, they precoded information on widely replicated collections and machines with multiple names in to Google’s crawler and repeated a crawl. The duplication now was only 13%, despite the fact that the crawl size now was 35million pages!
[Interesting…. and I can see a sci-fi fantasy where duplicates, which are creating confusion and havoc in the bit world are being checked by the geeks – a war ensues and finally, for now geeks won, and the duplicate bits vow for a revenge.. part-2 continues! :P]
-Likewise, each crawl makes a collection of information on replicated content, which won’t be crawled in future.
Apart from the above mentioned paragraph, I found the whole of the introductory sections extremely readable.
Reading ‘History’ is good, anyways! 🙂