• DocumentCode
    3346374
  • Title

    An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings

  • Author

    Wang Huijiao ; Yin Bo ; Hou Jie

  • Author_Institution
    Sch. of Comput. & Control of Comput. Sci., Guilin Univ. of Electron. Technol., Guilin, China
  • fYear
    2009
  • fDate
    14-17 Oct. 2009
  • Firstpage
    414
  • Lastpage
    417
  • Abstract
    This paper proposes an improved algorithm of STC for deleting duplicated Web pages based on repeated strings. The main method of the algorithm is to extract repeated character strings. The repeated strings are used as the mark of each phrase in order to build the suffix tree. This is mapped onto the inverse index in order to enable the STC algorithm to delete duplication. The algorithm also aims to reduce the errors made by the existing algorithms for deletion. Experimental results indicate that the improved algorithm has a better rate of accuracy and good temporal and spatial characteristics.
  • Keywords
    Web sites; document handling; string matching; STC algorithm; duplicated Web page deletion; inverse index; repeated character string extraction; suffix tree; Algorithm design and analysis; Clustering algorithms; Computer science; Data mining; Fingerprint recognition; Genetics; Internet; Paper technology; Search engines; Web pages; deletion of duplicated Web pages; repeated string; the algorithm of STC;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Genetic and Evolutionary Computing, 2009. WGEC '09. 3rd International Conference on
  • Conference_Location
    Guilin
  • Print_ISBN
    978-0-7695-3899-0
  • Type

    conf

  • DOI
    10.1109/WGEC.2009.97
  • Filename
    5402860