DocumentCode
2118089
Title
A Generalized Links and Text Properties Based Forum Crawler
Author
Sachan, Abhishek ; Wee-Yong Lim ; Thing, Vrizlynn L. L.
Author_Institution
Cryptography & Security Dept., Inst. for Infocomm Res., Singapore, Singapore
Volume
1
fYear
2012
fDate
4-7 Dec. 2012
Firstpage
113
Lastpage
120
Abstract
Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of Web forums is necessary to gather/mine the information from them. However, a generic Web crawler is unable to efficiently and effectively crawl the Web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the Web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in Web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method.
Keywords
Web sites; data mining; information retrieval; text analysis; duplicate page elimination; generalized link-based Web forum crawler; generic Web crawler; information gathering; information mining; light-weight crawling method; redundant page elimination; text property-based Web forum crawler; user generated content; clustering; forum crawler; information retrieval;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location
Macau
Print_ISBN
978-1-4673-6057-9
Type
conf
DOI
10.1109/WI-IAT.2012.213
Filename
6511873
Link To Document