DocumentCode :
3099453
Title :
Crawling Strategy of Focused Crawler Based on Niche Genetic Algorithm
Author :
Fan, Huilian ; Zeng, Guangpu ; Li, Xianli
Author_Institution :
Sch. of Math. & Comput. Sci., Yangtze Normal Univ., Chongqing, China
fYear :
2009
fDate :
12-14 Dec. 2009
Firstpage :
591
Lastpage :
594
Abstract :
In order to improve the search efficiency of focused crawler, we design a new crawling strategy based on the niche genetic algorithm. Rather than colleting and indexing all accessible hypertext documents to be able to answer all possible ad-hoc queries, the new crawling strategy, combined the advantages of hyperlinks structure and web content strategies, uses hyperlink as genetic individual and topic-keywords based VSM is used to evaluate individual fitness, and imports new URLs to implement crossover and mutation, and the URLs that have the same prefix are regarded as niche. Guide the crawl direction by niche genetic algorithm to selectively seek out pages that are likely to be most relevant to a pre-defined set of topics. Compared with the other algorithms, experiments show that the strategy has higher precision and recall in searching the topic pages.
Keywords :
Internet; content management; genetic algorithms; hypermedia; query formulation; URL; Web content strategies; ad-hoc queries; crawling strategy; focused crawler; hyperlinks structure; hypertext documents; niche genetic algorithm; search efficiency; topic-keywords based VSM; vector space model; Algorithm design and analysis; Computer science; Crawlers; Genetic algorithms; Genetic mutations; Indexing; Mathematics; Search engines; Uniform resource locators; Web search; Vector Space Model; focused crawler; niche genetic algorithm; topic relevancy;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable, Autonomic and Secure Computing, 2009. DASC '09. Eighth IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-0-7695-3929-4
Electronic_ISBN :
978-1-4244-5421-1
Type :
conf
DOI :
10.1109/DASC.2009.49
Filename :
5380633
Link To Document :
بازگشت