• DocumentCode
    3587592
  • Title

    Adaptive Post Recognition

  • Author

    Berger, Philipp ; Hennig, Patrick ; Petrick, Dominic ; Pursche, Marcel ; Meinel, Christoph

  • Author_Institution
    Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
  • fYear
    2014
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Blogs, news portal and discussion forums are of high interest for today´s social interaction research. But the automatic information extraction from the raw html page of those media channels is still a well-known problem. We introduce a novel approach to infer website templates based on the syndication format of blogs and news portals, called feeds. In comparison to related approaches that infer templates by clustering generic pages, we do not rely on a manual annotated training set. Instead, we use the feeds and their linked articles as training set to identify characteristic XPaths. Those paths identify the exact article content and article properties like title, author and publishing date. Further, we can use those paths to detect article pages that are no longer linked from feeds. We show the precision gain by comparing the article content extraction with an alternative approach e.g. boilerplate.
  • Keywords
    Web sites; hypermedia markup languages; information retrieval; pattern clustering; portals; Website template; adaptive post recognition; article page detection; automatic information extraction; blogs; characteristic XPath identification; discussion forum; feeds; generic page clustering; media channels; news portal; raw HTML page; social interaction research; syndication format; training set; Blogs; Containers; Data mining; Feeds; HTML; Web pages; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on
  • Type

    conf

  • DOI
    10.1109/ASONAM.2014.7092993
  • Filename
    7092993