DocumentCode :
3211818
Title :
Data-rich section extraction from HTML pages
Author :
Wang, Jiying ; Lochovsky, Fred H.
Author_Institution :
Dept. of Comput. Sci., Univ. of Sci. & Technol., China
fYear :
2002
fDate :
12-14 Dec. 2002
Firstpage :
313
Lastpage :
322
Abstract :
We propose a novel algorithm, DSE (data-rich subtree extraction) to recognize and extract the data-rich section of an HTML page. We apply the DSE algorithm as a pre-processing "clean-up" step for two typical Web information retrieval problems: topic distillation and Web information extraction. Our experiments show that, for the test data sets used, the DSE algorithm can correctly identify data-rich sections of HTML pages with 100% accuracy. Therefore, it can effectively reduce the root set size for the topic distillation problem thereby improving the precision and accuracy of the IETS algorithm. Furthermore, when applied to the Web information extraction problem using the IEPAD algorithm, it can decrease the number of patterns discovered by this algorithm, thus shortening its time cost to generalize a wrapper for HTML pages.
Keywords :
Internet; Web sites; hypermedia markup languages; information retrieval; DSE algorithm; HITS algorithm; HTML pages; Web information extraction; Web information retrieval problems; accuracy; data-rich section extraction; data-rich subtree extraction; pattern discovery; pre-processing clean-up step; precision; root set size; topic distillation; wrapper; Bars; Computer science; Costs; Data mining; HTML; Information retrieval; Internet; Navigation; Testing; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on
Print_ISBN :
0-7695-1766-8
Type :
conf
DOI :
10.1109/WISE.2002.1181667
Filename :
1181667
Link To Document :
بازگشت