DocumentCode :
3455041
Title :
Analysis of Duplicated Web Pages Identification Methods in Search Engine
Author :
Duan, Fei ; Zheng, Yan
Author_Institution :
Sch. of Comput. Sci., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
27-28 Nov. 2010
Firstpage :
1
Lastpage :
5
Abstract :
The identification of duplicated web pages is one of the related steps in search engine. The effect of the identification will affect search engine´s performance. This article studies and summarizes the basic processing steps, key technologies of duplicated web pages identification in search engine. On the basis of some experiments, we analyze and contrast some basic algorithms´ performance. Then summarizes their advantages and disadvantages. Finally, we proposes an idea that use the distributed computing such as Hadoop to identify the duplicated web pages in order to make more efficiency when we try to process the massive internet information in search engine.
Keywords :
Internet; distributed processing; search engines; Hadoop distributed computing; duplicated Web page identification method; search engine; Arrays; Electronic mail; Feature extraction; Internet; Presses; Search engines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database Technology and Applications (DBTA), 2010 2nd International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-6975-8
Electronic_ISBN :
978-1-4244-6977-2
Type :
conf
DOI :
10.1109/DBTA.2010.5659105
Filename :
5659105
Link To Document :
بازگشت