DocumentCode :
1635814
Title :
A priority-based method of near-duplicated text information of web pages deletion
Author :
Ling, Yun ; Tao, Xiaobo ; Lv, Hexin
Author_Institution :
Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China
fYear :
2010
Firstpage :
495
Lastpage :
499
Abstract :
Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.
Keywords :
Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6054-0
Type :
conf
DOI :
10.1109/ICSESS.2010.5552319
Filename :
5552319
Link To Document :
بازگشت