• DocumentCode
    3098469
  • Title

    All common embedded subtrees for clustering XML documents by structure

  • Author

    Lin, Zhiwei ; Wang, Hui ; Mcclean, Sally ; Wang, Haiying

  • Author_Institution
    Fac. of Comput. & Eng., Univ. of Ulster, Coleraine, UK
  • Volume
    1
  • fYear
    2009
  • fDate
    12-15 July 2009
  • Firstpage
    13
  • Lastpage
    18
  • Abstract
    XML documents are tree-structured, and measuring the similarity of such tree structures plays a key role in XML clustering. In order to maximally capture common information for XML clustering, this paper investigates a novel similarity measurement - counting all common embedded subtrees of two trees, and its use for discovering latent hierarchical information for XML clustering. An efficient dynamic programming algorithm for counting all common embedded subtrees is proposed and also theoretically studied. The all common embedded subtrees similarity is employed in the definition of a dissimilarity measure for XML documents. This dissimilarity measure is evaluated in the standard hierarchical clustering framework on real XML documents. Experimental results show that all common embedded subtrees outperform the tree edit distance in clustering XML documents under the standard performance measures for clustering.
  • Keywords
    XML; dynamic programming; pattern clustering; trees (mathematics); XML documents clustering; dynamic programming algorithm; embedded subtrees; tree edit distance; tree structures; Cybernetics; Machine learning; XML; XML; all common embedded subtrees; clustering; tree edit distance; tree similarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2009 International Conference on
  • Conference_Location
    Baoding
  • Print_ISBN
    978-1-4244-3702-3
  • Electronic_ISBN
    978-1-4244-3703-0
  • Type

    conf

  • DOI
    10.1109/ICMLC.2009.5212557
  • Filename
    5212557