Title :
Visual Area Classification for Article Identification in Web Documents
Author_Institution :
Fac. of Inf. Technol., Brno Univ. of Technol., Brno, Czech Republic
fDate :
Aug. 30 2010-Sept. 3 2010
Abstract :
In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.
Keywords :
Internet; document handling; pattern classification; HTML documents; Web documents; article discovery; article identification; visual area classification; visual information analysis approach; HTML; Image color analysis; Layout; Portals; Training; Visualization; Web pages; article extraction; document cleaning; page segmentation; visual analysis;
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2010 Workshop on
Conference_Location :
Bilbao
Print_ISBN :
978-1-4244-8049-4
DOI :
10.1109/DEXA.2010.49