Title :
Web document duplicate removal algorithm based on keyword sequences
Author :
Li, Wei ; Liu, Jian-Yi ; Wang, Cong
Author_Institution :
CISTR, Beijing Univ. of Posts & Telecommun., China
fDate :
30 Oct.-1 Nov. 2005
Abstract :
There are many identical documents across the Web. The effective duplicate removal has become one of the most important techniques to improve search engines. In this paper, we take both syntax information and semantic information into account, and put forward a Web document duplicate removal algorithm based on keyword sequences, which is called KSM (keyword sequences method). The main intuition behind KSM is as follows. The keyword sequences of the Web document can be used to depict its structure feature (syntax) and intension feature (semantics). By the comparison of keyword sequences between similar documents, we can judge whether there is information redundancy. The experimental results show that KSM can greatly reduce the probability of mistaking similar documents for identical documents while remarkably improving the resistance to document noises.
Keywords :
Internet; document handling; information filtering; Web document duplicate removal algorithm; keyword sequence method; search engines; semantic information; syntax information; Fingerprint recognition; HTML; Internet; Mirrors; Noise reduction; Search engines; Uniform resource locators; Web pages; Web search; World Wide Web; Duplicate Removal; Keyword Sequences; Semantic Information; Syntax Information;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on
Print_ISBN :
0-7803-9361-9
DOI :
10.1109/NLPKE.2005.1598791