DocumentCode :
2118089
Title :
A Generalized Links and Text Properties Based Forum Crawler
Author :
Sachan, Abhishek ; Wee-Yong Lim ; Thing, Vrizlynn L. L.
Author_Institution :
Cryptography & Security Dept., Inst. for Infocomm Res., Singapore, Singapore
Volume :
1
fYear :
2012
fDate :
4-7 Dec. 2012
Firstpage :
113
Lastpage :
120
Abstract :
Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of Web forums is necessary to gather/mine the information from them. However, a generic Web crawler is unable to efficiently and effectively crawl the Web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the Web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in Web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method.
Keywords :
Web sites; data mining; information retrieval; text analysis; duplicate page elimination; generalized link-based Web forum crawler; generic Web crawler; information gathering; information mining; light-weight crawling method; redundant page elimination; text property-based Web forum crawler; user generated content; clustering; forum crawler; information retrieval;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location :
Macau
Print_ISBN :
978-1-4673-6057-9
Type :
conf
DOI :
10.1109/WI-IAT.2012.213
Filename :
6511873
Link To Document :
بازگشت