• DocumentCode
    3034868
  • Title

    Towards Compromising Structural and Bag of Words Approaches for Clustering Heterogeneous XML Documents

  • Author

    Zerida, Nadia ; Yao, Jin

  • Author_Institution
    GREYC Lab., Univ. of Caen - Ensicaen, Caen
  • fYear
    2008
  • fDate
    Sept. 29 2008-Oct. 4 2008
  • Firstpage
    69
  • Lastpage
    72
  • Abstract
    The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.
  • Keywords
    XML; pattern clustering; text analysis; Hapax words; bag of words approach; clustering quality; conceptual properties; constrained agglomerative clustering; documents clustering; heterogeneous XML documents; macro average purity measure; micro average purity measure; path based descriptors; structural approach; structural properties; Clustering algorithms; Clustering methods; Computer applications; Fourier transforms; Information management; Information retrieval; Laboratories; Organizing; Testing; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Engineering Computing and Applications in Sciences, 2008. ADVCOMP '08. The Second International Conference on
  • Conference_Location
    Valencia
  • Print_ISBN
    978-0-7695-3369-8
  • Electronic_ISBN
    978-0-7695-3369-8
  • Type

    conf

  • DOI
    10.1109/ADVCOMP.2008.28
  • Filename
    4640995