Title :
Efficient focused crawling strategy using combination of link structure and content similarity
Author :
Cheng, Qu ; Beizhan, Wang ; Pianpian, Wei
Author_Institution :
Software Sch., Xiamen Univ., Xiamen
Abstract :
At present, focused crawler usually crawl pages using the link structure or page contents. But both of them have some flaws. So we designed an efficient crawling strategy, which combine the link structure with content similarity. We extracted topic feature vector automatically and judge the topic similarity of a page using combination of link structure and page content. We also forecast the URL similarity using link structure in topic pages. Experiments showed that this strategy effectively increase the precision of fetching topic pages.
Keywords :
Internet; data mining; information filtering; text analysis; URL similarity; content similarity; focused crawling strategy; link structure; page content; page crawling; topic feature vector extraction; topic page fetching; topic similarity; Algorithm design and analysis; Crawlers; Data mining; Feature extraction; IP networks; Information filtering; Information filters; Queueing analysis; Search engines; Uniform resource locators;
Conference_Titel :
IT in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on
Conference_Location :
Xiamen
Print_ISBN :
978-1-4244-3616-3
Electronic_ISBN :
978-1-4244-2511-2
DOI :
10.1109/ITME.2008.4744029