Title :
A priority-based method of near-duplicated text information of web pages deletion
Author :
Ling, Yun ; Tao, Xiaobo ; Lv, Hexin
Author_Institution :
Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China
Abstract :
Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.
Keywords :
Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;
Conference_Titel :
Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6054-0
DOI :
10.1109/ICSESS.2010.5552319