• DocumentCode
    1786316
  • Title

    A Rendering-Based Method for Selecting the Main Data Region in Web Pages

  • Author

    Neiva Lopes Figueiredo, Leandro ; Almeida Ferreira, Anderson ; Tavares de Assis, Guilherme

  • Author_Institution
    Dept. de Comput., Univ. Fed. de Ouro Preto, Ouro Preto, Brazil
  • fYear
    2014
  • fDate
    22-24 Oct. 2014
  • Firstpage
    24
  • Lastpage
    32
  • Abstract
    Extracting data from web pages is an important task for several applications, such as comparison shopping and data mining. Much of that data is provided by search result pages, in which each result, called search result record, represents a record from a database. One of the most important steps for extracting such records is identifying, among different data regions from a page, one that contains the records to be extracted. An incorrect identification of this region may lead to an incorrect extraction of the search result records. In this paper, we propose a simple but efficient method that generates path expression to select the main data region from a given page, based on the rendering area information of its elements. The generated path expression may be used by wrappers for extracting the search result records and its data units, reducing its complexity and increasing its accuracy. Experimental results using web pages from several domains show that the method is highly effective.
  • Keywords
    Internet; information retrieval; rendering (computer graphics); Web pages; comparison shopping; data extraction; data mining; data region; data units; path expression; rendering area information; rendering-based method; search result record; Accuracy; Browsers; Data mining; HTML; Rendering (computer graphics); Visualization; Web pages; main data region; path expression; rendering information; visual information; wrapper;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Congress (LA-WEB), 2014 9th Latin American
  • Conference_Location
    Ouro Preto
  • Print_ISBN
    978-1-4799-6952-4
  • Type

    conf

  • DOI
    10.1109/LAWeb.2014.14
  • Filename
    7000168