Title :
Web data extraction using textual anchors
Author :
Ahmad Pouramini;Shahram Nasiri
Author_Institution :
Department of Computer Engineering, Sirjan University of Technology, Sirjan, Iran
Abstract :
In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.
Keywords :
"Decision support systems","Data mining","Information retrieval","Time complexity"
Conference_Titel :
Knowledge-Based Engineering and Innovation (KBEI), 2015 2nd International Conference on
DOI :
10.1109/KBEI.2015.7436204