DocumentCode
1786316
Title
A Rendering-Based Method for Selecting the Main Data Region in Web Pages
Author
Neiva Lopes Figueiredo, Leandro ; Almeida Ferreira, Anderson ; Tavares de Assis, Guilherme
Author_Institution
Dept. de Comput., Univ. Fed. de Ouro Preto, Ouro Preto, Brazil
fYear
2014
fDate
22-24 Oct. 2014
Firstpage
24
Lastpage
32
Abstract
Extracting data from web pages is an important task for several applications, such as comparison shopping and data mining. Much of that data is provided by search result pages, in which each result, called search result record, represents a record from a database. One of the most important steps for extracting such records is identifying, among different data regions from a page, one that contains the records to be extracted. An incorrect identification of this region may lead to an incorrect extraction of the search result records. In this paper, we propose a simple but efficient method that generates path expression to select the main data region from a given page, based on the rendering area information of its elements. The generated path expression may be used by wrappers for extracting the search result records and its data units, reducing its complexity and increasing its accuracy. Experimental results using web pages from several domains show that the method is highly effective.
Keywords
Internet; information retrieval; rendering (computer graphics); Web pages; comparison shopping; data extraction; data mining; data region; data units; path expression; rendering area information; rendering-based method; search result record; Accuracy; Browsers; Data mining; HTML; Rendering (computer graphics); Visualization; Web pages; main data region; path expression; rendering information; visual information; wrapper;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Congress (LA-WEB), 2014 9th Latin American
Conference_Location
Ouro Preto
Print_ISBN
978-1-4799-6952-4
Type
conf
DOI
10.1109/LAWeb.2014.14
Filename
7000168
Link To Document