On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm

Author

Costa, Gianni ; Ortale, Riccardo

Author_Institution

ICAR, Rende, Italy

Volume

1

fYear

2012

fDate

7-9 Nov. 2012

Firstpage

389

Lastpage

396

Abstract

XML clustering by structure is, in its most general form, the process of partitioning a corpus of XML documents into disjoint clusters, such that intra-cluster structural homogeneity is high and inter-cluster structural homogeneity is low. In this paper, we propose an algorithm that implements a partitioning strategy, in which root-to-leaf paths are used to separate the XML documents. Paths are discriminatory substructures and, thus, the effectiveness of our algorithm is accordingly high. Moreover, a suitable encoding is adopted for representing and testing the occurrence of the individual paths within each XML document independently of the length of such paths. Not only this expedites clustering, but it also makes our algorithm scalable to process large-scale corpora of XML documents. A comparative evaluation over several standard (real-word and synthetic) XML corpora reveals that our algorithm outperforms all of its competitors in efficiency and scalability, while being as effective as the top-notch competitors. One especially appealing property of the proposed algorithm is that it achieves these levels of performance by automatically establishing a natural number of clusters to be discovered in the underlying XML corpus.

Keywords

XML; document handling; pattern clustering; XML clustering; XML corpora; XML documents; inter-cluster structural homogeneity; intra-cluster structural homogeneity; large-scale corpora; path commonality; root-to-leaf paths; top-notch competitors; Clustering algorithms; Complexity theory; Partitioning algorithms; Scalability; Standards; Vegetation; XML; Data Mining; XML clustering;

fLanguage

English

Publisher

ieee

Conference_Titel

Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on

Conference_Location

Athens

ISSN

1082-3409

Print_ISBN

978-1-4799-0227-9

Type

conf

DOI

10.1109/ICTAI.2012.60

Filename

6495072