DocumentCode
3098469
Title
All common embedded subtrees for clustering XML documents by structure
Author
Lin, Zhiwei ; Wang, Hui ; Mcclean, Sally ; Wang, Haiying
Author_Institution
Fac. of Comput. & Eng., Univ. of Ulster, Coleraine, UK
Volume
1
fYear
2009
fDate
12-15 July 2009
Firstpage
13
Lastpage
18
Abstract
XML documents are tree-structured, and measuring the similarity of such tree structures plays a key role in XML clustering. In order to maximally capture common information for XML clustering, this paper investigates a novel similarity measurement - counting all common embedded subtrees of two trees, and its use for discovering latent hierarchical information for XML clustering. An efficient dynamic programming algorithm for counting all common embedded subtrees is proposed and also theoretically studied. The all common embedded subtrees similarity is employed in the definition of a dissimilarity measure for XML documents. This dissimilarity measure is evaluated in the standard hierarchical clustering framework on real XML documents. Experimental results show that all common embedded subtrees outperform the tree edit distance in clustering XML documents under the standard performance measures for clustering.
Keywords
XML; dynamic programming; pattern clustering; trees (mathematics); XML documents clustering; dynamic programming algorithm; embedded subtrees; tree edit distance; tree structures; Cybernetics; Machine learning; XML; XML; all common embedded subtrees; clustering; tree edit distance; tree similarity;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2009 International Conference on
Conference_Location
Baoding
Print_ISBN
978-1-4244-3702-3
Electronic_ISBN
978-1-4244-3703-0
Type
conf
DOI
10.1109/ICMLC.2009.5212557
Filename
5212557
Link To Document