DocumentCode
2226728
Title
A hybrid method for Web data extraction
Author
Wang, Yu ; Zhou, Lizhu
Author_Institution
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
fYear
2003
fDate
13-17 Oct. 2003
Firstpage
417
Lastpage
420
Abstract
Web data extraction refers to the technology that helps people find wanted information from the Web. We first classify existing data extraction algorithms into two classes: top-down and bottom-up, and then analyze their strengths and weaknesses in terms of extraction accuracy. On the basis of this analysis, we present a hybrid algorithm: bi-direction data extraction (BiDDE for short), which takes the full strengths of both top-down and bottom-up algorithms and yet avoid their weaknesses. The experimental results show that BiDDE has not only higher accuracy than top-down algorithm and bottom-up algorithm, but satisfactory performance.
Keywords
Internet; hypermedia markup languages; information retrieval; tree searching; HTML documents; Web data extraction; bi-direction data extraction algorithm; bottom-up algorithms; information retrieval; top-down algorithms; Algorithm design and analysis; Bidirectional control; Computer science; Data mining; Databases; HTML; Internet; Particle separators; Web pages; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
Print_ISBN
0-7695-1932-6
Type
conf
DOI
10.1109/WI.2003.1241229
Filename
1241229
Link To Document