DocumentCode :
2041267
Title :
Measuring similarity of web pages on maximum isomorphic subtree
Author :
Hu, Zhenyu ; Sun, Fuchun
Author_Institution :
Nat. Lab. of Inf. Sci. & Technol., Tsinghua Univ., Beijing, China
Volume :
5
fYear :
2010
fDate :
10-12 Aug. 2010
Firstpage :
2469
Lastpage :
2473
Abstract :
This paper studies the problem of comparing or looking for structured data in DOM trees. The proposed notion of structure descriptor of ordered tree fully represents the structure information of a DOM tree in a serialized style, indicating an efficient method to convert a DOM tree into its node sequence. Based on this notion, this paper produced an algorithm to measure the similarity of two web pages, by looking for maximum isomorphic subtrees in the serialized node sequences. When used to compare two web pages, the algorithm has the time complexity of O(n2), while used to look for certain structured object from a web page, its complexity reaches O(n). Experimental results using a number of well known web pages from diverse domains show that the proposed technique is able to identify similar structured objects very accurately.
Keywords :
Internet; computational complexity; tree data structures; trees (mathematics); DOM tree; Web page; maximum isomorphic subtree; serialized data representation; Complexity theory; Discrete Fourier transforms; HTML; Low earth orbit satellites; Web pages; XML; DOM tree; isomorphic subtree; serialization; similarity; web page;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location :
Yantai, Shandong
Print_ISBN :
978-1-4244-5931-5
Type :
conf
DOI :
10.1109/FSKD.2010.5569792
Filename :
5569792
Link To Document :
بازگشت