• DocumentCode
    1607797
  • Title

    InForCE: Forum data crawling with information extraction

  • Author

    Zhang, Can ; Zhang, Jingwei

  • Author_Institution
    Inst. of Massive Comput., East China Normal Univ., Shanghai, China
  • fYear
    2010
  • Firstpage
    367
  • Lastpage
    373
  • Abstract
    Forum data acquisition is the prerequisite of forum data analysis, such as opinion analysis, on-line advertisement, and so on. Since the structure of forum data usually has casual relationships with the page structure, effective forum data acquisition requires the integration of Web pages crawling and information extraction. In this paper, we propose a system InForCE for this purpose. The system includes two parts. First, we download Web pages from different forums and generate HTML documents. Second, structured data are extracted from HTML documents in the light of user requirements. During the extraction process, a novel algorithm has been proposed to transform user requirement into XSLT automatically. Our experimental results show that structured data extraction is feasible and efficient.
  • Keywords
    Internet; data analysis; hypermedia markup languages; information retrieval; HTML document generation; InForCE; Web pages crawling; XSLT; forum data acquisition; forum data analysis; forum data crawling; information extraction; online advertisement; opinion analysis; Crawlers; Data analysis; Data mining; HTML; Transforms; Web pages; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Universal Communication Symposium (IUCS), 2010 4th International
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-7821-7
  • Type

    conf

  • DOI
    10.1109/IUCS.2010.5666252
  • Filename
    5666252