DocumentCode
3346374
Title
An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings
Author
Wang Huijiao ; Yin Bo ; Hou Jie
Author_Institution
Sch. of Comput. & Control of Comput. Sci., Guilin Univ. of Electron. Technol., Guilin, China
fYear
2009
fDate
14-17 Oct. 2009
Firstpage
414
Lastpage
417
Abstract
This paper proposes an improved algorithm of STC for deleting duplicated Web pages based on repeated strings. The main method of the algorithm is to extract repeated character strings. The repeated strings are used as the mark of each phrase in order to build the suffix tree. This is mapped onto the inverse index in order to enable the STC algorithm to delete duplication. The algorithm also aims to reduce the errors made by the existing algorithms for deletion. Experimental results indicate that the improved algorithm has a better rate of accuracy and good temporal and spatial characteristics.
Keywords
Web sites; document handling; string matching; STC algorithm; duplicated Web page deletion; inverse index; repeated character string extraction; suffix tree; Algorithm design and analysis; Clustering algorithms; Computer science; Data mining; Fingerprint recognition; Genetics; Internet; Paper technology; Search engines; Web pages; deletion of duplicated Web pages; repeated string; the algorithm of STC;
fLanguage
English
Publisher
ieee
Conference_Titel
Genetic and Evolutionary Computing, 2009. WGEC '09. 3rd International Conference on
Conference_Location
Guilin
Print_ISBN
978-0-7695-3899-0
Type
conf
DOI
10.1109/WGEC.2009.97
Filename
5402860
Link To Document