Title :
Extracting Multi-Records from Web Pages
Author_Institution :
Key Lab. of Data Eng. & Knowledge Eng., Renmin Univ. of China, Beijing, China
Abstract :
Extracting multi-records from web pages is useful, it allows us to integrate information from multiple sources to provide value-added services. Existing techniques still have some limitations because of their several restrictions and accuracy. This paper proposes a new method to perform multi-records extraction task automatically. Firstly, the HTML tag tree is build based on an embedded browser interface to solve the AJAX problem. Secondly, data regions are found out by data chunk comparison, and simple tree matching method is proposed to compute the chunk similarity. Finally, the main data region is determined and the multi-records are extracted out. Experimental results show that our method dramatically outperforms other existing methods, and it can extract multi-records from pages very accurately.
Keywords :
Web sites; data structures; hypermedia markup languages; online front-ends; user interfaces; HTML tag tree; Web pages; chunk similarity; data regions; embedded browser interface; multirecords extraction; tree matching method; value-added services; Cascading style sheets; Conference management; Data engineering; Data mining; HTML; Information resources; Java; Knowledge engineering; Laboratories; Web pages; Multi-records extraction; Web2.0; tree similarity; wrapper generation;
Conference_Titel :
Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-0-7695-3401-5
Electronic_ISBN :
978-0-7695-3401-5
DOI :
10.1109/SKG.2008.47