loading...
Design and Implementation of a High-Performance Distributed Web Crawler
San Jose, California February 26-March 01
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2002.99475018th International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Vladislav Shkapenyuk, Polytechnic University
Torsten Suel, Polytechnic University
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of $120$ million pages on $5$ million hosts.
Index Terms:
world wide web, WWW, web search, search engines, crawler, distributed crawling, network of workstations, I/O efficiency
Citation:
Vladislav Shkapenyuk, Torsten Suel, "Design and Implementation of a High-Performance Distributed Web Crawler," icde, pp.0357, 18th International Conference on Data Engineering (ICDE'02), 2002
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions