Title :
A Web Page De-duplication Algorithm Based on Data Clearing
Author :
Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei
Author_Institution :
Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China
Abstract :
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
Keywords :
Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;
Conference_Titel :
Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
Conference_Location :
Hainan Island
Print_ISBN :
978-0-7695-3615-6
DOI :
10.1109/JCAI.2009.181