Title :
Near-duplicate web page detection: A comparative study of two contrary approaches
Author :
Narayana, V.A. ; Govardhan, A. ; Premchand, P.
Author_Institution :
Dept. of CSE, CMR Coll. of Eng. & Tech, Hyderabad, India
fDate :
Nov. 29 2011-Dec. 1 2011
Abstract :
Detection of duplicate and near-duplicate web pages has attracted voluminous research among the web crawling research community. There have been a considerable number of significant researches available in the literature for near-duplicate detection, but, none has been accepted as a universal solution. G.S. Manku et al.\´s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. In our earlier work, we had devised an efficient similarity score based approach for near-duplicate web page detection. The experimentation on the proposed approach has showed that it has achieved almost detection accuracy identical to G.S. Manku et al.\´s fingerprint based approach. Hence, in this paper, we conduct an extensive comparative study between our similarity score based approach and G.S. Manku et al.\´s fingerprint based approach in terms of the computational factors namely: 1) Time and 2) Storage space. The performances of the two approaches were considered to be ideally the same, and so, we take up complexity measures namely time and memory space to determine the better approach of the two. The comparison study clearly portrays the better (less complex) of the two approaches based on the factors considered.
Keywords :
Web sites; fingerprint identification; search engines; storage management; Web crawling research community; Web search engine; complexity measures; detection accuracy; fingerprint based approach; memory space; near-duplicate Web page detection; similarity score; storage space; Fingerprint; Near-duplicate; Similarity score; Storage space; Time; Web crawling; permutation;
Conference_Titel :
Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
Conference_Location :
Seogwipo
Print_ISBN :
978-1-4577-0472-7