DocumentCode :
2450578
Title :
A Web Page De-duplication Algorithm Based on Data Clearing
Author :
Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei
Author_Institution :
Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China
fYear :
2009
fDate :
25-26 April 2009
Firstpage :
544
Lastpage :
547
Abstract :
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
Keywords :
Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
Conference_Location :
Hainan Island
Print_ISBN :
978-0-7695-3615-6
Type :
conf
DOI :
10.1109/JCAI.2009.181
Filename :
5159062
Link To Document :
بازگشت