مرکز منطقه ای اطلاع رساني علوم و فناوري - Web document duplicate removal algorithm based on keyword sequences

DocumentCode :

3317979

Title :

Web document duplicate removal algorithm based on keyword sequences

Author :

Li, Wei ; Liu, Jian-Yi ; Wang, Cong

Author_Institution :

CISTR, Beijing Univ. of Posts & Telecommun., China

fYear :

2005

fDate :

30 Oct.-1 Nov. 2005

Firstpage :

511

Lastpage :

516

Abstract :

There are many identical documents across the Web. The effective duplicate removal has become one of the most important techniques to improve search engines. In this paper, we take both syntax information and semantic information into account, and put forward a Web document duplicate removal algorithm based on keyword sequences, which is called KSM (keyword sequences method). The main intuition behind KSM is as follows. The keyword sequences of the Web document can be used to depict its structure feature (syntax) and intension feature (semantics). By the comparison of keyword sequences between similar documents, we can judge whether there is information redundancy. The experimental results show that KSM can greatly reduce the probability of mistaking similar documents for identical documents while remarkably improving the resistance to document noises.

Keywords :

Internet; document handling; information filtering; Web document duplicate removal algorithm; keyword sequence method; search engines; semantic information; syntax information; Fingerprint recognition; HTML; Internet; Mirrors; Noise reduction; Search engines; Uniform resource locators; Web pages; Web search; World Wide Web; Duplicate Removal; Keyword Sequences; Semantic Information; Syntax Information;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on

Print_ISBN :

0-7803-9361-9

Type :

conf

DOI :

10.1109/NLPKE.2005.1598791

Filename :

1598791

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3317979