مرکز منطقه ای اطلاع رساني علوم و فناوري - Data-rich section extraction from HTML pages

DocumentCode :

3211818

Title :

Data-rich section extraction from HTML pages

Author :

Wang, Jiying ; Lochovsky, Fred H.

Author_Institution :

Dept. of Comput. Sci., Univ. of Sci. & Technol., China

fYear :

2002

fDate :

12-14 Dec. 2002

Firstpage :

313

Lastpage :

322

Abstract :

We propose a novel algorithm, DSE (data-rich subtree extraction) to recognize and extract the data-rich section of an HTML page. We apply the DSE algorithm as a pre-processing "clean-up" step for two typical Web information retrieval problems: topic distillation and Web information extraction. Our experiments show that, for the test data sets used, the DSE algorithm can correctly identify data-rich sections of HTML pages with 100% accuracy. Therefore, it can effectively reduce the root set size for the topic distillation problem thereby improving the precision and accuracy of the IETS algorithm. Furthermore, when applied to the Web information extraction problem using the IEPAD algorithm, it can decrease the number of patterns discovered by this algorithm, thus shortening its time cost to generalize a wrapper for HTML pages.

Keywords :

Internet; Web sites; hypermedia markup languages; information retrieval; DSE algorithm; HITS algorithm; HTML pages; Web information extraction; Web information retrieval problems; accuracy; data-rich section extraction; data-rich subtree extraction; pattern discovery; pre-processing clean-up step; precision; root set size; topic distillation; wrapper; Bars; Computer science; Costs; Data mining; HTML; Information retrieval; Internet; Navigation; Testing; Web pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on

Print_ISBN :

0-7695-1766-8

Type :

conf

DOI :

10.1109/WISE.2002.1181667

Filename :

1181667

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3211818