Title :
Detection and optimized disposal of near-duplicate pages
Author :
Qiu, Junping ; Zeng, Qian
Author_Institution :
Coll. of Inf. Manage., Wuhan Univ., Wuhan, China
Abstract :
Search engine is an important tool for users to access network information resources. However, a large number of duplicate and near-duplicate pages added user´s burden. Currently, search engines only remove duplicate pages, but have not yet any effective strategies in detecting and disposing near-duplicate pages. This paper analyzed the existing algorithms to select an appropriate algorithm to detect near-duplicate pages, and optimized the disposing strategy to ensure that near-duplicate pages would not take up too much space in search results while being used effectively. These will allow users to retrieve needed information more easily.
Keywords :
search engines; near-duplicate pages detection; near-duplicate pages disposal; search engine; Algorithm design and analysis; Clustering algorithms; Educational institutions; Frequency; Information management; Information resources; Information retrieval; Search engines; Uniform resource locators; Web pages; Duplicate Detection; Information retrieval; Near-Duplicate; Ranking algorithm; Search Engine;
Conference_Titel :
Future Computer and Communication (ICFCC), 2010 2nd International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-5821-9
DOI :
10.1109/ICFCC.2010.5497544