DocumentCode :
1803678
Title :
Semi-Supervised Clustering of XML Documents: Getting the Most from Structural Information
Author :
Bezerra, Eduardo ; Mattoso, Marta ; Xexéo, Geraldo
Author_Institution :
CEFET/RJ & COPPE/UFRJ, Brazil
fYear :
2006
fDate :
2006
Firstpage :
88
Lastpage :
88
Abstract :
As document providers can express more contextualized and complex information, semi-structured documents are becoming a major source of information in many areas, e.g., in digital libraries, e-commerce or Web applications. A particular characteristic of such document collections is the existence of some structure or metadata along with the data. In this scenario, clustering methods that can take advantage of such structural information to better organize such collections are highly relevant. Semi-structured documents pose new challenges to document clustering methods, however, since it is not clear how this structural information can be used to improve the quality of the generated clustering models. On the other hand, recently there has a growing interest in the semi-supervised clustering task, in which a little amount of prior knowledge is provided to guide the algorithm to a better clustering model. A particular type of semi-supervision is in the form of user-provided constraints defined over pairs of objects, where each pair informs if its objects must be in the same or in different clusters. In this paper, we consider the problem of constrained clustering in documents that present some form of structural information. We consider the existence of a particular form of information to be clustered: textual documents that present a logical structure represented in XML format. We define and extend methods to improve the quality of clustering results by using such structural information to guide the execution of the constrained clustering algorithm. Experimental results on the OHSUMED document collection show the effectiveness of our approach.
Keywords :
Bellows; Clustering algorithms; Clustering methods; Conferences; Data engineering; Information resources; Partitioning algorithms; Software libraries; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on
Conference_Location :
Atlanta, GA, USA
Print_ISBN :
0-7695-2571-7
Type :
conf
DOI :
10.1109/ICDEW.2006.136
Filename :
1623883
Link To Document :
بازگشت