• DocumentCode
    3334862
  • Title

    Extracting Partitional Clusters from Heterogeneous Datasets using Mutual Entropy

  • Author

    Hossain, Mahmood ; Bridges, Susan ; Wang, Yong ; Hodges, Julia

  • Author_Institution
    Fairmont State Univ., Fairmont
  • fYear
    2007
  • fDate
    13-15 Aug. 2007
  • Firstpage
    447
  • Lastpage
    454
  • Abstract
    Clustering has traditionally been used for partitioning the objects of a single dataset. Some applications may require the clustering of multiple related heterogeneous datasets where it may not be easy to compute a useful and effective integrated feature space. In this paper, we present an algorithm called CEMENT (Cluster Ensemble using Mutual ENTropy) to address the problem of clustering two related datasets where the datasets represent the same or overlapping sets of objects but use different feature sets. The algorithm takes the partitional clusters generated from two datasets as input and uses a constraint-based approach to generate a single set of clusters. CEMENT is an EM (expectation maximization) approach where the objective function is the mutual entropy between the two sets of clusters. The algorithm was applied to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. These documents were pre-processed using several NLP (natural language processing) steps to extract syntactic and semantic feature sets. We present empirical results and statistical tests showing that CEMENT yields higher quality clusters with this dataset than several baseline clustering approaches.
  • Keywords
    computational linguistics; document handling; expectation-maximisation algorithm; maximum entropy methods; natural language processing; pattern clustering; constraint-based approach; document collection; expectation maximization approach; heterogeneous dataset clustering; mutual entropy; natural language processing; objective function; semantic feature set extraction; statistical test; syntactic feature set extraction; Abstracts; Application software; Bridges; Clustering algorithms; Computer science; Data engineering; Entropy; Libraries; Natural language processing; Partitioning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
  • Conference_Location
    Las Vegas, IL
  • Print_ISBN
    1-4244-1500-4
  • Electronic_ISBN
    1-4244-1500-4
  • Type

    conf

  • DOI
    10.1109/IRI.2007.4296661
  • Filename
    4296661