Title :
Analysis of Duplicated Web Pages Identification Methods in Search Engine
Author :
Duan, Fei ; Zheng, Yan
Author_Institution :
Sch. of Comput. Sci., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
The identification of duplicated web pages is one of the related steps in search engine. The effect of the identification will affect search engine´s performance. This article studies and summarizes the basic processing steps, key technologies of duplicated web pages identification in search engine. On the basis of some experiments, we analyze and contrast some basic algorithms´ performance. Then summarizes their advantages and disadvantages. Finally, we proposes an idea that use the distributed computing such as Hadoop to identify the duplicated web pages in order to make more efficiency when we try to process the massive internet information in search engine.
Keywords :
Internet; distributed processing; search engines; Hadoop distributed computing; duplicated Web page identification method; search engine; Arrays; Electronic mail; Feature extraction; Internet; Presses; Search engines; Web pages;
Conference_Titel :
Database Technology and Applications (DBTA), 2010 2nd International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-6975-8
Electronic_ISBN :
978-1-4244-6977-2
DOI :
10.1109/DBTA.2010.5659105