Title :
Web Data Extraction Based on Simple Tree Matching
Author :
Wang, Hua ; Zhang, Yang
Author_Institution :
Coll. of Inf. Eng., Northwest A&F Univ., Yangling, China
Abstract :
The information on the Internet has been grown exponentially, the Internet users are overwhelmed by these information. How to automatically extract useful information from the relevant pages, so as to provide a convenient and rapid information query platform for the users, is an important issue. In this paper, based on simple tree matching algorithm, we present a Web data extraction method based on simple tree matching by analyzing the structure and content of Web documents. Experimental results on Web data from several famous websites show that the proposed Web data extraction method can effectively extract data records from similar Web pages, with extraction precision reached about 90%, and can meet the requirement of extracting accurate data in real-life applications.
Keywords :
Web services; data mining; query processing; trees (mathematics); Internet; Web data extraction method; Web documents; Web pages; Web sites; information query platform; simple tree matching algorithm; Artificial intelligence; Books; Data mining; Feature extraction; HTML; Heuristic algorithms; Web pages; DOM; Information Extraction; Simple tree matching; XPath;
Conference_Titel :
Information Engineering (ICIE), 2010 WASE International Conference on
Conference_Location :
Beidaihe, Hebei
Print_ISBN :
978-1-4244-7506-3
Electronic_ISBN :
978-1-4244-7507-0
DOI :
10.1109/ICIE.2010.100