Practical Web Crawling Issues by Carlos Castillo.
I don’t remember how I came to this document. I think I was looking for some website copier-like crawlers, which I can use to run crawls to get a few 100s or 1000s of webpages from some seed websites and google showed me this.
As I browsed through this, it appeared to me as well written and comprehensive despite the fact that the issues are kind of obvious when you think of a crawler. I was never involved in designing a crawler, though I used a couple of crawlers over a period of time. To me, it provided a good overview. Once I began to write about this, I tried to find out who the author is, and here he is 🙂
“Carlos Castillo is a research scientist at Yahoo! Research, Barcelona. He received his Ph.D from the University of Chile in 2004, and was a visiting scientist at Universitat Pompeu Fabra (2005) and Universita di Roma La Sapienza (2006). At age 33 (2010), he is an active researcher with over 30 publications in international conferences and journals in the areas of Web search and mining. He has served in the program committee of most major conferences in his area (WWW, WSDM, SIGIR, CIKM, etc.), and co-organized the Adversarial Information Retrieval Workshop and Web Spam Challenge in 2007 and 2008, and the ECML/PKDD Discovery Challenge 2010. His current research interest is the mining of content, links and usage data from the Web.”
-That bio says it all, about the credentials of the document 😛
Coming to the point, this is a brief ‘executive’ summary of the document.
Practical web crawling issues can be broadly classified in to 6 categories:
1)Networking issues (variable network quality, server admins concerns, inconsistent firewall settings)
2)Massive DNS Resolving (crashing of local DNS servers, Temporary DNS failures, Malformed DNS records, Wrong DNS records, Use of WWW prefix)
3)HTTP Implementations(Accept Headers not honored, Range Errors, Response Lacking Headers, Found where you mean error, Wrong dates in headers)
4) HTML Coding (Wrong markup, Physical over logical content representation)
5) Webcontent Characteristics (blogging,mailinglists and forums, Duplicate Detection)
6) Server Application Programming (Embedded Session Ids, Repeated Path Components, Slower/Erroneous pages)
The best part about this document is that it is not an enumeration. It also suggests some solutions wherever possible. Most of us need not write a crawler at all, even when we need one. However, if we use a crawler, I think it helps to understand some of these issues. Taking an extreme stance, I would say that every search engine user ought to know more about crawling than what he/she knows now. I would not say that this is the only document, but, this is a good one to know more, for sure.
I think I would read more from this thesis soon, whenever I feel compelled to do so 😉