Title :
Web Data Extraction Based on XBRL-GL Taxonomy
Author :
Hanyang Luo ; Jinling Gao
Author_Institution :
Coll. of Manage., Shenzhen Univ., Shenzhen, China
Abstract :
The Web has become one of the most important connections to various information resources. The most interesting challenge is how to extract important data from a large number of Web pages and transform them to more structural, standard and semantic information, which can be queried and analyzed by using matured techniques in database, data warehouse and other fields. We design a wrapper generator by combining the data extraction technique with XBRL technology based on XBRL-GL taxonomy. The wrapper can transform HTML documents to XML forms according to the analysis of HTML document structure, and then use XPath to locate the data. In this way, we can extract the data accurately and store them in a standard form.
Keywords :
Web sites; XML; information retrieval; HTML document; Web data extraction; Web page; XBRL-GL taxonomy; XML document; XPath; data warehouse; database; information resources; semantic information; wrapper generator; Data mining; Data warehouses; Databases; HTML; Information analysis; Information resources; Taxonomy; Text analysis; Web pages; XML; Web data extraction; XBRL-GL taxonomy; XML; XPath;
Conference_Titel :
Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-0-7695-3699-6
DOI :
10.1109/APCIP.2009.97