DocumentCode :
3098469
Title :
All common embedded subtrees for clustering XML documents by structure
Author :
Lin, Zhiwei ; Wang, Hui ; Mcclean, Sally ; Wang, Haiying
Author_Institution :
Fac. of Comput. & Eng., Univ. of Ulster, Coleraine, UK
Volume :
1
fYear :
2009
fDate :
12-15 July 2009
Firstpage :
13
Lastpage :
18
Abstract :
XML documents are tree-structured, and measuring the similarity of such tree structures plays a key role in XML clustering. In order to maximally capture common information for XML clustering, this paper investigates a novel similarity measurement - counting all common embedded subtrees of two trees, and its use for discovering latent hierarchical information for XML clustering. An efficient dynamic programming algorithm for counting all common embedded subtrees is proposed and also theoretically studied. The all common embedded subtrees similarity is employed in the definition of a dissimilarity measure for XML documents. This dissimilarity measure is evaluated in the standard hierarchical clustering framework on real XML documents. Experimental results show that all common embedded subtrees outperform the tree edit distance in clustering XML documents under the standard performance measures for clustering.
Keywords :
XML; dynamic programming; pattern clustering; trees (mathematics); XML documents clustering; dynamic programming algorithm; embedded subtrees; tree edit distance; tree structures; Cybernetics; Machine learning; XML; XML; all common embedded subtrees; clustering; tree edit distance; tree similarity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2009 International Conference on
Conference_Location :
Baoding
Print_ISBN :
978-1-4244-3702-3
Electronic_ISBN :
978-1-4244-3703-0
Type :
conf
DOI :
10.1109/ICMLC.2009.5212557
Filename :
5212557
Link To Document :
بازگشت