• DocumentCode
    3108745
  • Title

    A Direct Web Page Templates Detection Method

  • Author

    Xie Su-bin ; Liang Bin ; Shi Wen-chang ; Liang Zhao-hui ; Yu Xiu-mei ; Zhang Lei

  • Author_Institution
    Sch. of Inf., Renmin Univ. of China, Beijing, China
  • fYear
    2011
  • fDate
    16-18 Aug. 2011
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.
  • Keywords
    Web sites; hypermedia markup languages; search engines; DOM tree characteristics; HTML page; direct web page templates detection method; hash value; search engine; web sites; Accuracy; Compression algorithms; Fingerprint recognition; HTML; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Internet Technology and Applications (iTAP), 2011 International Conference on
  • Conference_Location
    Wuhan
  • Print_ISBN
    978-1-4244-7253-6
  • Type

    conf

  • DOI
    10.1109/ITAP.2011.6006435
  • Filename
    6006435