• DocumentCode
    2770214
  • Title

    A Self-Organising Map Approach for Clustering of XML Documents

  • Author

    Trentini, Fabricio ; Hagenbuchner, M. ; Sperduti, Alessandro ; Scarselli, Franco

  • Author_Institution
    Siena Univ., Siena
  • fYear
    0
  • fDate
    0-0 0
  • Firstpage
    1805
  • Lastpage
    1812
  • Abstract
    The number of XML documents produced and available on the Internet is steadily increasing. It is thus important to devise automatic procedures to extract useful information from them with little or no intervention by a human operator. In this paper, we investigate the efficacy of an unsupervised learning approach, namely self-organising maps (SOMs), for the automatic clustering of XML documents. Specifically, we consider a relatively large corpus of XML formatted data from the INEX initiative and evaluate it using two different self-organising map models. The first model is the classical SOM model, and it requires the XML documents to be represented by real-valued vectors, obtained using a "bag of words" (or better a "bag of tags") approach. The other model is the SOM for structured data (SOM-SD) approach which is able to cluster structured data, and it is possible to feed the model with tree structured representations of the XML documents, thus explicitly preserving the structural information in the documents. The experimental results show that the SOM model exhibits quite a poor performance on this problem domain which requires the ability to encode structural properties of the data. The SOM-SD model, on the other hand, is able to produce a good clustering and generalization performance.
  • Keywords
    XML; document handling; self-organising feature maps; Internet; XML documents clustering; cluster structured data; self-organising map approach; unsupervised learning approach; Data mining; Feeds; HTML; Humans; Internet; Machine learning; Neural networks; Search engines; Unsupervised learning; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Neural Networks, 2006. IJCNN '06. International Joint Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    0-7803-9490-9
  • Type

    conf

  • DOI
    10.1109/IJCNN.2006.246898
  • Filename
    1716328