On Practical Web Crawling issues

Practical Web Crawling Issues by Carlos Castillo.

I don’t remember how I came to this document. I think I was looking for some website copier-like crawlers, which I can use to run crawls to get a few 100s or 1000s of webpages from some seed websites and google showed me this.

As I browsed through this, it appeared to me as well written and comprehensive despite the fact that the issues are kind of obvious when you think of a crawler. I was never involved in designing a crawler, though I used a couple of crawlers over a period of time. To me, it provided a good overview. Once I began to write about this, I tried to find out who the author is, and here he is 🙂

“Carlos Castillo is a research scientist at Yahoo! Research, Barcelona. He received his Ph.D from the University of Chile in 2004, and was a visiting scientist at Universitat Pompeu Fabra (2005) and Universita di Roma La Sapienza (2006). At age 33 (2010), he is an active researcher with over 30 publications in international conferences and journals in the areas of Web search and mining. He has served in the program committee of most major conferences in his area (WWW, WSDM, SIGIR, CIKM, etc.), and co-organized the Adversarial Information Retrieval Workshop and Web Spam Challenge in 2007 and 2008, and the ECML/PKDD Discovery Challenge 2010. His current research interest is the mining of content, links and usage data from the Web.”
-That bio says it all, about the credentials of the document 😛

Coming to the point, this is a brief ‘executive’ summary of the document.

Practical web crawling issues can be broadly classified in to 6 categories:
1)Networking issues (variable network quality, server admins concerns, inconsistent firewall settings)
2)Massive DNS Resolving (crashing of local DNS servers, Temporary DNS failures, Malformed DNS records, Wrong DNS records, Use of WWW prefix)
3)HTTP Implementations(Accept Headers not honored, Range Errors, Response Lacking Headers, Found where you mean error, Wrong dates in headers)
4) HTML Coding (Wrong markup, Physical over logical content representation)
5) Webcontent Characteristics (blogging,mailinglists and forums, Duplicate Detection)
6) Server Application Programming (Embedded Session Ids, Repeated Path Components, Slower/Erroneous pages)

The best part about this document is that it is not an enumeration. It also suggests some solutions wherever possible. Most of us need not write a crawler at all, even when we need one. However, if we use a crawler, I think it helps to understand some of these issues. Taking an extreme stance, I would say that every search engine user ought to know more about crawling than what he/she knows now. I would not say that this is the only document, but, this is a good one to know more, for sure.

I think I would read more from this thesis soon, whenever I feel compelled to do so 😉

Published in: on November 12, 2010 at 8:00 am  Comments (2)  

The URI to TrackBack this entry is: https://vbsowmya.wordpress.com/2010/11/12/on-practical-web-crawling-issues/trackback/

RSS feed for comments on this post.

2 CommentsLeave a comment

  1. Useful info…. thanks for the post….
    “A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots,[1] or Web spiders, Web robots, or—especially in the FOAF community—Web scutters.
    This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam”

  2. […] the original: On Practical Web Crawling issues « sowmyawrites …. By admin | category: POMPEU FABRA University | tags: active-researcher, chile, fabra, […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: