• DocumentCode
    658363
  • Title

    GRABEX: A Graph-Based Method for Web Site Block Classification and Its Application on Mining Breadcrumb Trails

  • Author

    Keller, Matthias ; Hartenstein, Hannes

  • Author_Institution
    Steinbuch Centre for Comput., Karlsruhe Inst. of Technol., Karlsruhe, Germany
  • Volume
    1
  • fYear
    2013
  • fDate
    17-20 Nov. 2013
  • Firstpage
    290
  • Lastpage
    297
  • Abstract
    In order to interact with a Web site, humans must be able to distinguish and understand the purposes of different page blocks, e.g. header, navigation bar or content area. In case of navigational blocks, the block type determines the functionality of the hyperlinks it contains. For example, the hyperlinks in the main menu block represent the main topics of a site while the hyperlinks in a breadcrumb trail show the location in the content hierarchy. Hence, mining navigational blocks of specific types can provide valuable input for applications in the fields of crawling, ranking or presenting search results. However, analyzing visual features in order to identify specific navigational blocks as humans do is a difficult, resource-consuming task and a general solution does not exist yet. In this paper, we propose a novel approach to the problem and present the Graph-based block extraction method (GRABEX) that can be adapted to classify different types of navigational blocks. The fundamental concept is that a separate graph-based link-analysis is conducted for groups of blocks. Each block group consists of blocks from different pages that have similar CSS class attributes. This allows discovering navigational blocks of specific types, e.g. breadcrumb trails, without analyzing any presentational features. We apply our method to mine breadcrumb trails and are the first to describe an applicable solution to this problem. In an extensive evaluation including 700 different sites, the GRABEX-method performed with perfect precision and high recall.
  • Keywords
    Web sites; data mining; graph theory; pattern classification; user interfaces; CSS class attributes; GRABEX; Web site block classification; breadcrumb trail mining; content area; content hierarchy; graph-based block extraction method; graph-based link-analysis; header; hyperlink functionality; menu block; navigation bar; navigational block classification; navigational block mining; page blocks; precision; recall; visual feature analysis; Data mining; Feature extraction; HTML; Navigation; Vegetation; Visualization; Web sites; Web site block classification; breadcrumb mining; page segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on
  • Conference_Location
    Atlanta, GA
  • Print_ISBN
    978-1-4799-2902-3
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2013.42
  • Filename
    6690028