Title :
Web Data Extraction System Based on Label Library
Author :
Tan, Shoubiao ; Xu, Chao ; Jiang, Yuan
Author_Institution :
Sch. of Electron. Sci. & Technol., Anhui Univ., Hefei, China
Abstract :
A Web information extraction system based on label library is proposed for extracting information from data intensive Web pages in this paper. It downloads dynamic Web pages based on a knowledge database, changes them to XML documents after a preprocessing, mines data regions by using MDR repeated patterns discovery algorithm, recognizes their structure and extracts data from them through a novel hierarchic pattern recognition and data extraction algorithm based on label library, and stores the data into the knowledge database to support further information extraction. Experiments showed that the system has high precision and is adaptive to Web pages in different domains and with different structures.
Keywords :
Internet; XML; data mining; document handling; information retrieval; pattern recognition; MDR repeated patterns discovery algorithm; Web data extraction system; XML documents; data mining; dynamic Web pages; hierarchic pattern recognition; knowledge database; label library; Chaos; Data mining; Databases; Fuzzy systems; Information resources; Internet; Libraries; Pattern recognition; Uniform resource locators; Web pages; Label Library; Web Data Extraction System; data intensive web pages; hierarchic pattern recognition and data extraction;
Conference_Titel :
Fuzzy Systems and Knowledge Discovery, 2009. FSKD '09. Sixth International Conference on
Conference_Location :
Tianjin
Print_ISBN :
978-0-7695-3735-1
DOI :
10.1109/FSKD.2009.208