Title of article :
Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis
Author/Authors :
Lin، نويسنده , , Shian-Hua and Chu، نويسنده , , Kuan-Pak and Chiu، نويسنده , , Chun-Ming، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2011
Abstract :
Sitemaps designed by webmasters are not only presenting the main usage flows for users, but also organizing the hierarchical concept of the website. However, websites seldom provide sitemap pages to facilitate users to browse pages easily. Even provided, these sitemaps are not for machine-understanding, although few websites provide sitemaps with the XML format. In this paper, we develop a system, SiteMap Generator (SMG), to automatically generate the hierarchical sitemap for a website. SMG consists of five components. Sequence Translator translates a page’s HTML source into a long sequence and then Page Partitioner splits the page into blocks based on analyzing the sequence complexity. Block Identifier categorizes each block into one of three block types: content, structure or redundant. Using the popular sequence searching tool, BLAST, Block Cluster calculates similarities between blocks so that blocks with similar functionalities are grouped and considered as candidate blocks for the sitemap. Finally, Hyperlink Analyzer transforms page-to-page links into block-to-block links and applies Kleinberg’s HITS algorithm to estimate authority and hub values of each block. Block entropy value derived from features entropies is also used to improve the HITS. Several experiments on three websites: Mozilla, CNN and Yahoo! News, show that SMG is useful to partition a page into blocks (F1 = 86%), identify the block type (F1 = 85%), and generate the sitemap for the website (F1 = 63%).
Keywords :
Block extraction , WEB MINING , Hyperlink analysis , Sitemap , Sequence analysis
Journal title :
Expert Systems with Applications
Journal title :
Expert Systems with Applications