• DocumentCode
    2118089
  • Title

    A Generalized Links and Text Properties Based Forum Crawler

  • Author

    Sachan, Abhishek ; Wee-Yong Lim ; Thing, Vrizlynn L. L.

  • Author_Institution
    Cryptography & Security Dept., Inst. for Infocomm Res., Singapore, Singapore
  • Volume
    1
  • fYear
    2012
  • fDate
    4-7 Dec. 2012
  • Firstpage
    113
  • Lastpage
    120
  • Abstract
    Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of Web forums is necessary to gather/mine the information from them. However, a generic Web crawler is unable to efficiently and effectively crawl the Web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the Web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in Web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method.
  • Keywords
    Web sites; data mining; information retrieval; text analysis; duplicate page elimination; generalized link-based Web forum crawler; generic Web crawler; information gathering; information mining; light-weight crawling method; redundant page elimination; text property-based Web forum crawler; user generated content; clustering; forum crawler; information retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
  • Conference_Location
    Macau
  • Print_ISBN
    978-1-4673-6057-9
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2012.213
  • Filename
    6511873