Title :
Extracting Partitional Clusters from Heterogeneous Datasets using Mutual Entropy
Author :
Hossain, Mahmood ; Bridges, Susan ; Wang, Yong ; Hodges, Julia
Author_Institution :
Fairmont State Univ., Fairmont
Abstract :
Clustering has traditionally been used for partitioning the objects of a single dataset. Some applications may require the clustering of multiple related heterogeneous datasets where it may not be easy to compute a useful and effective integrated feature space. In this paper, we present an algorithm called CEMENT (Cluster Ensemble using Mutual ENTropy) to address the problem of clustering two related datasets where the datasets represent the same or overlapping sets of objects but use different feature sets. The algorithm takes the partitional clusters generated from two datasets as input and uses a constraint-based approach to generate a single set of clusters. CEMENT is an EM (expectation maximization) approach where the objective function is the mutual entropy between the two sets of clusters. The algorithm was applied to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. These documents were pre-processed using several NLP (natural language processing) steps to extract syntactic and semantic feature sets. We present empirical results and statistical tests showing that CEMENT yields higher quality clusters with this dataset than several baseline clustering approaches.
Keywords :
computational linguistics; document handling; expectation-maximisation algorithm; maximum entropy methods; natural language processing; pattern clustering; constraint-based approach; document collection; expectation maximization approach; heterogeneous dataset clustering; mutual entropy; natural language processing; objective function; semantic feature set extraction; statistical test; syntactic feature set extraction; Abstracts; Application software; Bridges; Clustering algorithms; Computer science; Data engineering; Entropy; Libraries; Natural language processing; Partitioning algorithms;
Conference_Titel :
Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
Conference_Location :
Las Vegas, IL
Print_ISBN :
1-4244-1500-4
Electronic_ISBN :
1-4244-1500-4
DOI :
10.1109/IRI.2007.4296661