Abstract :
XML co-clustering is a promising method to overcome the effectiveness of traditional XML clustering approaches, due to the exploitation of the mutual relationships between XML documents and their respective XML features while clustering both simultaneously. To shed light on this so far unexplored research direction, we conduct a systematic study of the effectiveness of XML co-clustering, by viewing the task as parametric with respect to the XML features. Thus, the definition and exploitation of three distinct types of XML features, which are respectively informative of the content, structure and both aspects of the XML documents, allows an in-depth investigation of all three different instances of the XML co-clustering task, i.e., XML co-clustering by content alone, structure alone as well as both structure and content. XML co-clustering relies on a non-negative matrix trifactorization technique, that efficiently processes large-scale input data, which is especially useful with large corpora of text-centric XML documents. The relevance of the structural and content features of the XML documents is assessed through a new weighting scheme. An intensive experimental evaluation on real-world benchmark XML corpora reveals a higher effectiveness of XML co-clustering in comparison with state-of-the-art approaches to XML clustering. Insights are also provided on the effectiveness of XML feature clustering.
Keywords :
XML; document handling; matrix decomposition; pattern clustering; XML clustering approach; XML co-clustering; XML coclustering; XML corpora; XML document coclustering; XML feature clustering; content feature; nonnegative matrix trifactorization; structural feature; text-centric XML document; Context; Electronic publishing; Encyclopedias; Matrix decomposition; Vegetation; XML; Semistructured Data Mining; XML Analysis; XML Co-Clustering;