• DocumentCode
    1839999
  • Title

    Distilling Informative Content from HTML News Pages

  • Author

    Ziegler, Cai-Nicolas ; Vögele, Christian ; Viermetz, Maximilian

  • Volume
    1
  • fYear
    2009
  • fDate
    15-18 Sept. 2009
  • Firstpage
    707
  • Lastpage
    712
  • Abstract
    Not only the Web abounds of information overload, but also its component molecules, the Web documents contained therein. In particular HTML news pages have become aggregates of cornucopian information blocks, such as advertisements, link lists, disclaimers and terms of use, or comments from readers. Thus, only a small fraction of all textual content appears dedicated to the actual news article itself. The amalgamation of relevant content with page clutter poses considerable concerns to applications that make use of such news information, such as search engines. We present an approach geared towards dissecting relevant from irrelevant textual content in an automated fashion. Our system extracts linguistic and structural features from merged text segments and applies various classifiers thereafter. We have conducted empirical analyses in order to compare our approach´s classification performance with a human gold standard as well as two benchmark systems.
  • Keywords
    Aggregates; Conferences; Content based retrieval; Data mining; Gold; HTML; Humans; Intelligent agent; Performance analysis; Search engines; content extraction; learning; passage retrieval;
  • fLanguage
    English
  • Publisher
    iet
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences on
  • Conference_Location
    Milan, Italy
  • Print_ISBN
    978-0-7695-3801-3
  • Electronic_ISBN
    978-1-4244-5331-3
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2009.119
  • Filename
    5284891