With the explosion of the Web, focused web crawlers are gaining attention. Focused web crawlers aim at finding web pages related to the pre-defined topic. CINDI Robot is a focused web crawler devoted to finding computer science and software engineering academic documents. We propose a multi-level inspection scheme to discover relevant web pages. Through this multi-level inspection scheme, the text feature of the content contributes to the classification; furthermore other web characteristics, such as URL pattern, anchor text and so on, assist the decision process. The experiment result demonstrates this multi-level inspection method outperforms other traditional methods.
Index Terms:
focused web crawler, SVM classifier, Na?ve Bayes classifier, multi-level inspection, revised context graph, tunneling
Citation:
Rui Chen, Bipin C. Desai, Cong Zhou, "CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection," ideas, pp.93-101, 11th International Database Engineering and Applications Symposium (IDEAS 2007), 2007