DocumentCode
1175697
Title
Exploiting interclass rules for focused crawling
Author
Altingövde, Ismail Sengör ; Ulusoy, Özgür
Author_Institution
Dept. of Comput. Eng., Bilkent Univ., Ankara, Turkey
Volume
19
Issue
6
fYear
2004
Firstpage
66
Lastpage
73
Abstract
Crawling the Web quickly and entirely is an expensive, unrealistic goal because of the required hardware and network resources. We started with a focused-crawling approach designed by Soumen Chakrabarti, Martin van den Berg, and Byron Dom, and we implemented the underlying philosophy of their approach to derive our baseline crawler. This crawler employs a canonical topic taxonomy to train a naive-Bayesian classifier, which then helps determine the relevancy of crawled pages. The crawler also relies on the assumption of topical locality to decide which URLs to visit next. Building on this crawler, we developed a rule-based crawler, which uses simple rules derived from interclass (topic) linkage patterns to decide its next move. This rule-based crawler also enhances the baseline crawler by supporting tunneling. A focused crawler gathers relevant Web pages on a particular topic. This rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler´s harvest rate and coverage.
Keywords
Internet; belief networks; data mining; knowledge based systems; online front-ends; pattern classification; search engines; URL; Web pages; baseline focused crawler; canonical topic taxonomy; focused Web crawling; interclass linkage patterns; interclass rules; linkage statistics; naive-Bayesian classifier; network resources; rule-based Web-crawling; Couplings; Crawlers; Hardware; Protocols; Robots; Statistics; Uniform resource locators; Waste materials; Web pages; Yarn; Web mining; focused Web crawling; naïve Bayesian classification; rule extraction; tunneling;
fLanguage
English
Journal_Title
Intelligent Systems, IEEE
Publisher
ieee
ISSN
1541-1672
Type
jour
DOI
10.1109/MIS.2004.62
Filename
1363737
Link To Document