Title :
Data-rich section extraction from HTML pages
Author :
Wang, Jiying ; Lochovsky, Fred H.
Author_Institution :
Dept. of Comput. Sci., Univ. of Sci. & Technol., China
Abstract :
We propose a novel algorithm, DSE (data-rich subtree extraction) to recognize and extract the data-rich section of an HTML page. We apply the DSE algorithm as a pre-processing "clean-up" step for two typical Web information retrieval problems: topic distillation and Web information extraction. Our experiments show that, for the test data sets used, the DSE algorithm can correctly identify data-rich sections of HTML pages with 100% accuracy. Therefore, it can effectively reduce the root set size for the topic distillation problem thereby improving the precision and accuracy of the IETS algorithm. Furthermore, when applied to the Web information extraction problem using the IEPAD algorithm, it can decrease the number of patterns discovered by this algorithm, thus shortening its time cost to generalize a wrapper for HTML pages.
Keywords :
Internet; Web sites; hypermedia markup languages; information retrieval; DSE algorithm; HITS algorithm; HTML pages; Web information extraction; Web information retrieval problems; accuracy; data-rich section extraction; data-rich subtree extraction; pattern discovery; pre-processing clean-up step; precision; root set size; topic distillation; wrapper; Bars; Computer science; Costs; Data mining; HTML; Information retrieval; Internet; Navigation; Testing; Web pages;
Conference_Titel :
Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on
Print_ISBN :
0-7695-1766-8
DOI :
10.1109/WISE.2002.1181667