مرکز منطقه ای اطلاع رساني علوم و فناوري - Detection of near duplicate web pages using four stage algorithm

DocumentCode :

3687214

Title :

Detection of near duplicate web pages using four stage algorithm

Author :

Nirmalrani V;Eldhose P Sim; Arun PR

Author_Institution :

Department of Information Technology, Sathyabama University, Chennai, Tamilnadu, India

fYear :

2015

fDate :

4/1/2015 12:00:00 AM

Firstpage :

644

Lastpage :

648

Abstract :

In the recent years there is a massive development in the web pages, there are billions of web pages existing in the search engine which decreases the efficiency and effectiveness of the search results of the search engine. The existing web pages can be duplicated web pages or near duplicate web pages. In this paper, we are going to deal about the detection of the near duplicate web pages. The near duplicate web pages are due to a replica of the original site, mirrored site, versioned site, multiple representations of the same object, plagiarized document, etc. These kind of near duplicates web pages decrease the efficiency of search results and provide irrelevant search results to the user. There are several methods for finding the near duplicate web pages. In this paper, we are proposing a four stage algorithm for the detection of near duplicate web pages, which include pre-processing, minimum weighting, filtering and verification.

Keywords :

"Filtering","Indexes","Uniform resource locators"

Publisher :

ieee

Conference_Titel :

Communications and Signal Processing (ICCSP), 2015 International Conference on

Type :

conf

DOI :

10.1109/ICCSP.2015.7322567

Filename :

7322567

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3687214