مرکز منطقه ای اطلاع رساني علوم و فناوري - A priority-based method of near-duplicated text information of web pages deletion

DocumentCode :

1635814

Title :

A priority-based method of near-duplicated text information of web pages deletion

Author :

Ling, Yun ; Tao, Xiaobo ; Lv, Hexin

Author_Institution :

Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China

fYear :

2010

Firstpage :

495

Lastpage :

499

Abstract :

Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.

Keywords :

Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on

Conference_Location :

Beijing

Print_ISBN :

978-1-4244-6054-0

Type :

conf

DOI :

10.1109/ICSESS.2010.5552319

Filename :

5552319

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1635814