• DocumentCode
    2508429
  • Title

    A new method on the detection of near-replicas of web pages

  • Author

    Jia-heng Zheng ; Li-xia Wei ; Hong-ye Tan

  • Author_Institution
    Dept. of Comput. & Inf. Technol., Shanxi Univ., Taiyuan
  • fYear
    2008
  • fDate
    8-11 July 2008
  • Firstpage
    473
  • Lastpage
    478
  • Abstract
    Near-replicas of web pages have seriously decreased the efficiency of search engine (SE). In this paper, we present a new method to detect near-replicas of web pages. Firstly, the styles of text structures in web pages are analyzed and classified; then according to the styles of the text, different methods are used to get the text structure, which will be represented as a matrix; Finally, the similarity will be calculated by extracting the features dynamically from the matrix. Experiments show that this method can not only improve the computing efficiency but also ensure high precision and recall.
  • Keywords
    Internet; classification; text analysis; Web pages near-replicas; text structure analysis; text structure classification; Blogs; Data mining; Feature extraction; HTML; Indexing; Information analysis; Information technology; Navigation; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-1-4244-2357-6
  • Electronic_ISBN
    978-1-4244-2358-3
  • Type

    conf

  • DOI
    10.1109/CIT.2008.4594721
  • Filename
    4594721