Title :
Detection of near duplicate web pages using four stage algorithm
Author :
Nirmalrani V;Eldhose P Sim; Arun PR
Author_Institution :
Department of Information Technology, Sathyabama University, Chennai, Tamilnadu, India
fDate :
4/1/2015 12:00:00 AM
Abstract :
In the recent years there is a massive development in the web pages, there are billions of web pages existing in the search engine which decreases the efficiency and effectiveness of the search results of the search engine. The existing web pages can be duplicated web pages or near duplicate web pages. In this paper, we are going to deal about the detection of the near duplicate web pages. The near duplicate web pages are due to a replica of the original site, mirrored site, versioned site, multiple representations of the same object, plagiarized document, etc. These kind of near duplicates web pages decrease the efficiency of search results and provide irrelevant search results to the user. There are several methods for finding the near duplicate web pages. In this paper, we are proposing a four stage algorithm for the detection of near duplicate web pages, which include pre-processing, minimum weighting, filtering and verification.
Keywords :
"Filtering","Indexes","Uniform resource locators"
Conference_Titel :
Communications and Signal Processing (ICCSP), 2015 International Conference on
DOI :
10.1109/ICCSP.2015.7322567