DocumentCode :
751148
Title :
Link contexts in classifier-guided topical crawlers
Author :
Pant, Gautam ; Srinivasan, Padmini
Author_Institution :
Sch. of Accounting & Inf. Syst., Utah Univ., Salt Lake City, UT, USA
Volume :
18
Issue :
1
fYear :
2006
Firstpage :
107
Lastpage :
122
Abstract :
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a support vector machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.
Keywords :
Internet; data mining; hypermedia; information retrieval; search engines; support vector machines; Web information retrieval; Web mining; Web search; classifier-guided topical Web crawler; hyperlink context; information categorization; information navigation; support vector machine; tag tree hierarchy; Competitive intelligence; Content based retrieval; Crawlers; Information retrieval; Navigation; Search engines; Support vector machines; Training data; Web mining; Web pages; Index Terms- Web Search; Web mining; performance evaluation.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2006.12
Filename :
1549831
Link To Document :
بازگشت