• Title of article

    Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

  • Author/Authors

    Lin، نويسنده , , Shian-Hua and Chu، نويسنده , , Kuan-Pak and Chiu، نويسنده , , Chun-Ming، نويسنده ,

  • Issue Information
    روزنامه با شماره پیاپی سال 2011
  • Pages
    15
  • From page
    3944
  • To page
    3958
  • Abstract
    Sitemaps designed by webmasters are not only presenting the main usage flows for users, but also organizing the hierarchical concept of the website. However, websites seldom provide sitemap pages to facilitate users to browse pages easily. Even provided, these sitemaps are not for machine-understanding, although few websites provide sitemaps with the XML format. In this paper, we develop a system, SiteMap Generator (SMG), to automatically generate the hierarchical sitemap for a website. SMG consists of five components. Sequence Translator translates a page’s HTML source into a long sequence and then Page Partitioner splits the page into blocks based on analyzing the sequence complexity. Block Identifier categorizes each block into one of three block types: content, structure or redundant. Using the popular sequence searching tool, BLAST, Block Cluster calculates similarities between blocks so that blocks with similar functionalities are grouped and considered as candidate blocks for the sitemap. Finally, Hyperlink Analyzer transforms page-to-page links into block-to-block links and applies Kleinberg’s HITS algorithm to estimate authority and hub values of each block. Block entropy value derived from features entropies is also used to improve the HITS. Several experiments on three websites: Mozilla, CNN and Yahoo! News, show that SMG is useful to partition a page into blocks (F1 = 86%), identify the block type (F1 = 85%), and generate the sitemap for the website (F1 = 63%).
  • Keywords
    Block extraction , WEB MINING , Hyperlink analysis , Sitemap , Sequence analysis
  • Journal title
    Expert Systems with Applications
  • Serial Year
    2011
  • Journal title
    Expert Systems with Applications
  • Record number

    2349054