DocumentCode
1839999
Title
Distilling Informative Content from HTML News Pages
Author
Ziegler, Cai-Nicolas ; Vögele, Christian ; Viermetz, Maximilian
Volume
1
fYear
2009
fDate
15-18 Sept. 2009
Firstpage
707
Lastpage
712
Abstract
Not only the Web abounds of information overload, but also its component molecules, the Web documents contained therein. In particular HTML news pages have become aggregates of cornucopian information blocks, such as advertisements, link lists, disclaimers and terms of use, or comments from readers. Thus, only a small fraction of all textual content appears dedicated to the actual news article itself. The amalgamation of relevant content with page clutter poses considerable concerns to applications that make use of such news information, such as search engines. We present an approach geared towards dissecting relevant from irrelevant textual content in an automated fashion. Our system extracts linguistic and structural features from merged text segments and applies various classifiers thereafter. We have conducted empirical analyses in order to compare our approach´s classification performance with a human gold standard as well as two benchmark systems.
Keywords
Aggregates; Conferences; Content based retrieval; Data mining; Gold; HTML; Humans; Intelligent agent; Performance analysis; Search engines; content extraction; learning; passage retrieval;
fLanguage
English
Publisher
iet
Conference_Titel
Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences on
Conference_Location
Milan, Italy
Print_ISBN
978-0-7695-3801-3
Electronic_ISBN
978-1-4244-5331-3
Type
conf
DOI
10.1109/WI-IAT.2009.119
Filename
5284891
Link To Document