Extracting Partitional Clusters from Heterogeneous Datasets using Mutual Entropy

Author

Hossain, Mahmood ; Bridges, Susan ; Wang, Yong ; Hodges, Julia

Author_Institution

Fairmont State Univ., Fairmont

fYear

2007

fDate

13-15 Aug. 2007

Firstpage

447

Lastpage

454

Abstract

Clustering has traditionally been used for partitioning the objects of a single dataset. Some applications may require the clustering of multiple related heterogeneous datasets where it may not be easy to compute a useful and effective integrated feature space. In this paper, we present an algorithm called CEMENT (Cluster Ensemble using Mutual ENTropy) to address the problem of clustering two related datasets where the datasets represent the same or overlapping sets of objects but use different feature sets. The algorithm takes the partitional clusters generated from two datasets as input and uses a constraint-based approach to generate a single set of clusters. CEMENT is an EM (expectation maximization) approach where the objective function is the mutual entropy between the two sets of clusters. The algorithm was applied to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. These documents were pre-processed using several NLP (natural language processing) steps to extract syntactic and semantic feature sets. We present empirical results and statistical tests showing that CEMENT yields higher quality clusters with this dataset than several baseline clustering approaches.

Keywords

computational linguistics; document handling; expectation-maximisation algorithm; maximum entropy methods; natural language processing; pattern clustering; constraint-based approach; document collection; expectation maximization approach; heterogeneous dataset clustering; mutual entropy; natural language processing; objective function; semantic feature set extraction; statistical test; syntactic feature set extraction; Abstracts; Application software; Bridges; Clustering algorithms; Computer science; Data engineering; Entropy; Libraries; Natural language processing; Partitioning algorithms;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on

Conference_Location

Las Vegas, IL

Print_ISBN

1-4244-1500-4

Electronic_ISBN

1-4244-1500-4

Type

conf

DOI

10.1109/IRI.2007.4296661

Filename

4296661