مرکز منطقه ای اطلاع رساني علوم و فناوري - Data Extraction using Content-Based Handles

Title of article :

Data Extraction using Content-Based Handles

Author/Authors :

Pouramini ، A. - University of Sirjan Technology , Khaje Hassani ، S. - University of Sirjan Technology , Nasiri ، Sh. - University of Sirjan Technology

Pages :

From page :

399

To page :

407

Abstract :

In this paper, we present an approach and a visual tool called Handle-based Wrapper (HWrap) for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content to identify the data regions on a web page. In our extraction algorithm, we were inspired by the way a human user scans the page content for a specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information

Keywords :

Web Data Record Extraction , Web Wrapper Generation , Web Information Extraction

Journal title :

Journal of Artificial Intelligence Data Mining

Serial Year :

2018

Journal title :

Journal of Artificial Intelligence Data Mining

Record number :

2449355

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2449355