• DocumentCode
    441630
  • Title

    A Novel Content and Style Based Measurement of Web Pages Distance

  • Author

    Zhang, Q.P. ; Liang, M. ; Lai, L.L.

  • Author_Institution
    Dept. of Computer Science and Engineering, Fudan University, Shanghai 200433, China; E-MAIL: qpzhang@fudan.edu.cn
  • Volume
    1
  • fYear
    2005
  • fDate
    18-21 Aug. 2005
  • Firstpage
    429
  • Lastpage
    435
  • Abstract
    Nowadays, many web-based systems have been using machine learning techniques in order to design more intelligent mechanisms for organizing, indexing, and retrieving web content, and it is necessary for more and more researches and applications to calculate the distance of web pages rationally. Generally proposed methodology is fit for extracting the differences between HTML documents of web pages, results of which cannot be used to tell the actual distance, between the content of web pages and the facade displayed in internet explorers. Based on these above, content distance, style distance, and hybrid distance are proposed in this paper, to make measurement result more practical. The efficiency will be proved through some classical experiments.
  • Keywords
    Web mining; Web page; cluster; distance function; Computer science; Content based retrieval; Distance measurement; HTML; Internet; Machine learning; Markup languages; Multimedia databases; Web mining; Web pages; Web mining; Web page; cluster; distance function;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
  • Conference_Location
    Guangzhou, China
  • Print_ISBN
    0-7803-9091-1
  • Type

    conf

  • DOI
    10.1109/ICMLC.2005.1526985
  • Filename
    1526985