مرکز منطقه ای اطلاع رساني علوم و فناوري - Analysis of Duplicated Web Pages Identification Methods in Search Engine

DocumentCode :

3455041

Title :

Analysis of Duplicated Web Pages Identification Methods in Search Engine

Author :

Duan, Fei ; Zheng, Yan

Author_Institution :

Sch. of Comput. Sci., Beijing Univ. of Posts & Telecommun., Beijing, China

fYear :

2010

fDate :

27-28 Nov. 2010

Firstpage :

Lastpage :

Abstract :

The identification of duplicated web pages is one of the related steps in search engine. The effect of the identification will affect search engine´s performance. This article studies and summarizes the basic processing steps, key technologies of duplicated web pages identification in search engine. On the basis of some experiments, we analyze and contrast some basic algorithms´ performance. Then summarizes their advantages and disadvantages. Finally, we proposes an idea that use the distributed computing such as Hadoop to identify the duplicated web pages in order to make more efficiency when we try to process the massive internet information in search engine.

Keywords :

Internet; distributed processing; search engines; Hadoop distributed computing; duplicated Web page identification method; search engine; Arrays; Electronic mail; Feature extraction; Internet; Presses; Search engines; Web pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Database Technology and Applications (DBTA), 2010 2nd International Workshop on

Conference_Location :

Wuhan

Print_ISBN :

978-1-4244-6975-8

Electronic_ISBN :

978-1-4244-6977-2

Type :

conf

DOI :

10.1109/DBTA.2010.5659105

Filename :

5659105

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3455041