• DocumentCode
    2483983
  • Title

    Combining content and structure similarity for XML document classification using composite SVM kernels

  • Author

    Ghosh, Saptarshi ; Mitra, Pabitra

  • Author_Institution
    Comput. Sci. & Eng., IIT Kharagpur, Kharagpur
  • fYear
    2008
  • fDate
    8-11 Dec. 2008
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Combination of structure and content features is necessary for effective retrieval and classification of XML documents. Composite kernels provide a way for fusion of content and structure information. In this paper, we demonstrate that a linear combination of simple and low cost kernels such as cosine similarity on terms and selective paths provide a good classification performance. We also propose a corpus-driven entropy-based heuristic for determining the optimal combination weights. Classification experiments performed on the INEX 1.3 XML corpus, demonstrate that the composite kernel classifier achieves significantly better performance as compared to complex and time consuming approaches.
  • Keywords
    XML; classification; entropy; information retrieval; support vector machines; INEX 1.3 XML corpus; XML document classification; XML document retrieval; composite SVM kernel classifier; corpus-driven entropy-based heuristic; Classification tree analysis; Content based retrieval; Fourier transforms; HTML; Indexing; Information retrieval; Kernel; Support vector machine classification; Support vector machines; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2008. ICPR 2008. 19th International Conference on
  • Conference_Location
    Tampa, FL
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4244-2174-9
  • Electronic_ISBN
    1051-4651
  • Type

    conf

  • DOI
    10.1109/ICPR.2008.4761539
  • Filename
    4761539