Title :
CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection
Author :
Chen, Rui ; Desai, Bipin C. ; Zhou, Cong
Author_Institution :
Concordia Univ., Montreal
Abstract :
With the explosion of the Web, focused Web crawlers are gaining attention. Focused Web crawlers aim at finding Web pages related to the pre-defined topic. CINDI Robot is a focused Web crawler devoted to finding computer science and software engineering academic documents. We propose a multi-level inspection scheme to discover relevant Web pages. Through this multi-level inspection scheme, the text feature of the content contributes to the classification; furthermore other Web characteristics, such as URL pattern, anchor text and so on, assist the decision process. The experiment result demonstrates this multi-level inspection method outperforms other traditional methods.
Keywords :
Internet; classification; indexing; information retrieval; online front-ends; CINDI robot; URL pattern; Web pages; World Wide Web; computer science documents; focused Web crawler; intelligent Web crawler; multilevel inspection; software engineering academic documents; Computer science; Crawlers; Inspection; Intelligent robots; Internet; Search engines; Software engineering; Statistical analysis; Uniform resource locators; Web pages; Bayes classifier; Naïve; SVM classifier; focused web crawler; graph; multi-level inspection; revised context; tunneling;
Conference_Titel :
Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th International
Conference_Location :
Banff, Alta.
Print_ISBN :
978-0-7695-2947-9
DOI :
10.1109/IDEAS.2007.4318093