مرکز منطقه ای اطلاع رساني علوم و فناوري - A Web Page De-duplication Algorithm Based on Data Clearing

DocumentCode :

2450578

Title :

A Web Page De-duplication Algorithm Based on Data Clearing

Author :

Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei

Author_Institution :

Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China

fYear :

2009

fDate :

25-26 April 2009

Firstpage :

544

Lastpage :

547

Abstract :

Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.

Keywords :

Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Artificial Intelligence, 2009. JCAI '09. International Joint Conference on

Conference_Location :

Hainan Island

Print_ISBN :

978-0-7695-3615-6

Type :

conf

DOI :

10.1109/JCAI.2009.181

Filename :

5159062

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2450578