• DocumentCode
    1850366
  • Title

    FODEX -- Towards Generic Data Extraction from Web Forums

  • Author

    Pretzsch, Sebastian ; Muthmann, Klemens ; Schill, Alexander

  • Author_Institution
    Fac. of Comput. Sci., Tech. Univ. Dresden, Dresden, Germany
  • fYear
    2012
  • fDate
    26-29 March 2012
  • Firstpage
    821
  • Lastpage
    826
  • Abstract
    The web is a large source for valuable data. Today, this data is not only provided by professional publishers, but everyone in the form of user-generated content. A large part of such content is located in web forums. As platforms to share knowledge, they are easily accessible for everyone. However, their vast amount makes it hard to find discussions on a specific topic. Automatic systems can filter and point to relevant information. Unfortunately, the content is presented in a human-readable layout and is not intended to be processed by automatic systems. Therefore, it is necessary to separate the content in a web forum discussion from the layout before doing any further information mining. This paper presents FODEX - a system for automatic forum data extraction. It extracts data from any forum and matches it to a unified data schema.
  • Keywords
    Internet; information resources; FODEX; Web forums; World Wide Web; data source; generic data extraction; Accuracy; Data mining; Feature extraction; HTML; Layout; Message systems; User-generated content; Information Extraction; Social Media; Web Scraping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on
  • Conference_Location
    Fukuoka
  • Print_ISBN
    978-1-4673-0867-0
  • Type

    conf

  • DOI
    10.1109/WAINA.2012.134
  • Filename
    6185496