A new method on the detection of near-replicas of web pages

Author

Jia-heng Zheng ; Li-xia Wei ; Hong-ye Tan

Author_Institution

Dept. of Comput. & Inf. Technol., Shanxi Univ., Taiyuan

fYear

2008

fDate

8-11 July 2008

Firstpage

473

Lastpage

478

Abstract

Near-replicas of web pages have seriously decreased the efficiency of search engine (SE). In this paper, we present a new method to detect near-replicas of web pages. Firstly, the styles of text structures in web pages are analyzed and classified; then according to the styles of the text, different methods are used to get the text structure, which will be represented as a matrix; Finally, the similarity will be calculated by extracting the features dynamically from the matrix. Experiments show that this method can not only improve the computing efficiency but also ensure high precision and recall.

Keywords

Internet; classification; text analysis; Web pages near-replicas; text structure analysis; text structure classification; Blogs; Data mining; Feature extraction; HTML; Indexing; Information analysis; Information technology; Navigation; Search engines; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on

Conference_Location

Sydney, NSW

Print_ISBN

978-1-4244-2357-6

Electronic_ISBN

978-1-4244-2358-3

Type

conf

DOI

10.1109/CIT.2008.4594721

Filename

4594721

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2508429