DocumentCode
2508429
Title
A new method on the detection of near-replicas of web pages
Author
Jia-heng Zheng ; Li-xia Wei ; Hong-ye Tan
Author_Institution
Dept. of Comput. & Inf. Technol., Shanxi Univ., Taiyuan
fYear
2008
fDate
8-11 July 2008
Firstpage
473
Lastpage
478
Abstract
Near-replicas of web pages have seriously decreased the efficiency of search engine (SE). In this paper, we present a new method to detect near-replicas of web pages. Firstly, the styles of text structures in web pages are analyzed and classified; then according to the styles of the text, different methods are used to get the text structure, which will be represented as a matrix; Finally, the similarity will be calculated by extracting the features dynamically from the matrix. Experiments show that this method can not only improve the computing efficiency but also ensure high precision and recall.
Keywords
Internet; classification; text analysis; Web pages near-replicas; text structure analysis; text structure classification; Blogs; Data mining; Feature extraction; HTML; Indexing; Information analysis; Information technology; Navigation; Search engines; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on
Conference_Location
Sydney, NSW
Print_ISBN
978-1-4244-2357-6
Electronic_ISBN
978-1-4244-2358-3
Type
conf
DOI
10.1109/CIT.2008.4594721
Filename
4594721
Link To Document