DocumentCode :
2013326
Title :
Layout Based Information Extraction from HTML Documents
Author :
Burget, Radek
Author_Institution :
Brno Univ. of Technol., Brno
Volume :
2
fYear :
2007
fDate :
23-26 Sept. 2007
Firstpage :
624
Lastpage :
628
Abstract :
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
Keywords :
document handling; hypermedia markup languages; information retrieval; HTML document; document layout detection; document visual information modelling; extraction task specification; layout based information extraction; page segmentation algorithm; visual feature; Algorithm design and analysis; Cascading style sheets; Data mining; HTML; Information analysis; Information technology; Page description languages; Robustness; Text analysis; Web sites;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
ISSN :
1520-5363
Print_ISBN :
978-0-7695-2822-9
Type :
conf
DOI :
10.1109/ICDAR.2007.4376990
Filename :
4376990
Link To Document :
بازگشت