DocumentCode :
3687214
Title :
Detection of near duplicate web pages using four stage algorithm
Author :
Nirmalrani V;Eldhose P Sim; Arun PR
Author_Institution :
Department of Information Technology, Sathyabama University, Chennai, Tamilnadu, India
fYear :
2015
fDate :
4/1/2015 12:00:00 AM
Firstpage :
644
Lastpage :
648
Abstract :
In the recent years there is a massive development in the web pages, there are billions of web pages existing in the search engine which decreases the efficiency and effectiveness of the search results of the search engine. The existing web pages can be duplicated web pages or near duplicate web pages. In this paper, we are going to deal about the detection of the near duplicate web pages. The near duplicate web pages are due to a replica of the original site, mirrored site, versioned site, multiple representations of the same object, plagiarized document, etc. These kind of near duplicates web pages decrease the efficiency of search results and provide irrelevant search results to the user. There are several methods for finding the near duplicate web pages. In this paper, we are proposing a four stage algorithm for the detection of near duplicate web pages, which include pre-processing, minimum weighting, filtering and verification.
Keywords :
"Filtering","Indexes","Uniform resource locators"
Publisher :
ieee
Conference_Titel :
Communications and Signal Processing (ICCSP), 2015 International Conference on
Type :
conf
DOI :
10.1109/ICCSP.2015.7322567
Filename :
7322567
Link To Document :
بازگشت