DocumentCode
3587592
Title
Adaptive Post Recognition
Author
Berger, Philipp ; Hennig, Patrick ; Petrick, Dominic ; Pursche, Marcel ; Meinel, Christoph
Author_Institution
Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
fYear
2014
Firstpage
1
Lastpage
8
Abstract
Blogs, news portal and discussion forums are of high interest for today´s social interaction research. But the automatic information extraction from the raw html page of those media channels is still a well-known problem. We introduce a novel approach to infer website templates based on the syndication format of blogs and news portals, called feeds. In comparison to related approaches that infer templates by clustering generic pages, we do not rely on a manual annotated training set. Instead, we use the feeds and their linked articles as training set to identify characteristic XPaths. Those paths identify the exact article content and article properties like title, author and publishing date. Further, we can use those paths to detect article pages that are no longer linked from feeds. We show the precision gain by comparing the article content extraction with an alternative approach e.g. boilerplate.
Keywords
Web sites; hypermedia markup languages; information retrieval; pattern clustering; portals; Website template; adaptive post recognition; article page detection; automatic information extraction; blogs; characteristic XPath identification; discussion forum; feeds; generic page clustering; media channels; news portal; raw HTML page; social interaction research; syndication format; training set; Blogs; Containers; Data mining; Feeds; HTML; Web pages; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on
Type
conf
DOI
10.1109/ASONAM.2014.7092993
Filename
7092993
Link To Document