DocumentCode :
2424146
Title :
Visual Area Classification for Article Identification in Web Documents
Author :
Burget, Radek
Author_Institution :
Fac. of Inf. Technol., Brno Univ. of Technol., Brno, Czech Republic
fYear :
2010
fDate :
Aug. 30 2010-Sept. 3 2010
Firstpage :
171
Lastpage :
175
Abstract :
In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the article discovery in complex HTML documents. We use a classification approach for the identification the important parts of the article within the page and we propose an algorithm for the detection of the article bounds within the page. Finally, we provide the results of an experimental evaluation.
Keywords :
Internet; document handling; pattern classification; HTML documents; Web documents; article discovery; article identification; visual area classification; visual information analysis approach; HTML; Image color analysis; Layout; Portals; Training; Visualization; Web pages; article extraction; document cleaning; page segmentation; visual analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2010 Workshop on
Conference_Location :
Bilbao
ISSN :
1529-4188
Print_ISBN :
978-1-4244-8049-4
Type :
conf
DOI :
10.1109/DEXA.2010.49
Filename :
5592057
Link To Document :
بازگشت