Title :
Heading-based sectional hierarchy identification for HTML documents
Author :
Pembe, F.C. ; Güngör, Tunga
Author_Institution :
Bogazici Univ., Istanbul
Abstract :
Most of the documents found on the Web are prepared in HTML format which was basically designed for presentation of data. As a result, some limitations are encountered when these documents are accessed automatically for a semantic interpretation of their content. One such inadequacy is in representing the sectional hierarchy (i.e. sections and subsections) of these documents and the headings in this hierarchy. Automatically obtaining this information is a difficult task due to the underlying format and the cluttered structure encountered in most of the Web pages. In this paper, we propose a novel approach to extract heading-based sectional hierarchies of HTML documents. This is the first part of the research, where we aim to use this information in automatic summaries to improve Web search experience of Internet users.
Keywords :
Internet; hypermedia markup languages; information retrieval; HTML documents; Web pages; Web search; automatic summaries; heading-based sectional hierarchy identification; Data engineering; Data mining; Design engineering; HTML; Information retrieval; Search engines; Text analysis; Web pages; Web search; XML;
Conference_Titel :
Computer and information sciences, 2007. iscis 2007. 22nd international symposium on
Conference_Location :
Ankara
Print_ISBN :
978-1-4244-1363-8
Electronic_ISBN :
978-1-4244-1364-5
DOI :
10.1109/ISCIS.2007.4456839