DocumentCode :
2929392
Title :
Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs
Author :
Chen Li ; Li, Chen ; Yu Zhong-hua ; Han Guo-hui
Author_Institution :
Coll. of Comput. Sci., Sichuan Univ., Chengdu, China
fYear :
2009
fDate :
12-14 Oct. 2009
Firstpage :
270
Lastpage :
273
Abstract :
It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with Web pages which are labeled manually or extracted from the open directory project (ODP), and then the classifiers judge the topical relevance of WebPages pointed to by hyperlinks in the crawler frontier. Though one can obtain labeled WebPages with comparative ease, however, training the classifiers with Web pages violates the overall hypothesis of machine learning about i.i.d (independent and identical distribution) between training and testing sets because the classification instances are hyperlinks (URLs) instead of WebPages. For the reason, this paper investigates and proposes a novel method based on templates for automatically labeling the positive URLs to develop classifier-guided topical crawlers. A series of off-line and on-line experiments are performed extensively. The results demonstrate that the classifier-guided topical crawler trained with labeled URLs has higher recall than the one trained with labeled WebPages. The results also prove that the classifier using immediate vicinity of hyperlinks and the corresponding anchor texts leads the crawler to attain harvest rate of about 95%.
Keywords :
Web sites; learning (artificial intelligence); pattern classification; URL; Web pages; classifier-guided topical crawler; hyperlinks; independent and identical distribution; machine learning; open directory project; Bayesian methods; Crawlers; Educational institutions; Labeling; Machine learning; Support vector machine classification; Support vector machines; Testing; Uniform resource locators; Web pages; SVM; classifier; link context; topical crawler; training set;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Semantics, Knowledge and Grid, 2009. SKG 2009. Fifth International Conference on
Conference_Location :
Zhuhai
Print_ISBN :
978-0-7695-3810-5
Type :
conf
DOI :
10.1109/SKG.2009.60
Filename :
5370119
Link To Document :
بازگشت