Title :
Layout Based Information Extraction from HTML Documents
Author_Institution :
Brno Univ. of Technol., Brno
Abstract :
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
Keywords :
document handling; hypermedia markup languages; information retrieval; HTML document; document layout detection; document visual information modelling; extraction task specification; layout based information extraction; page segmentation algorithm; visual feature; Algorithm design and analysis; Cascading style sheets; Data mining; HTML; Information analysis; Information technology; Page description languages; Robustness; Text analysis; Web sites;
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
Print_ISBN :
978-0-7695-2822-9
DOI :
10.1109/ICDAR.2007.4376990