• DocumentCode
    5670
  • Title

    Constrained Text Coclustering with Supervised and Unsupervised Constraints

  • Author

    Yangqiu Song ; Shimei Pan ; Shixia Liu ; Furu Wei ; Zhou, Michelle X. ; Weihong Qian

  • Author_Institution
    Microsoft Res. Asia, Beijing, China
  • Volume
    25
  • Issue
    6
  • fYear
    2013
  • fDate
    Jun-13
  • Firstpage
    1227
  • Lastpage
    1239
  • Abstract
    In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine information-theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches.
  • Keywords
    constraint handling; document handling; expectation-maximisation algorithm; hidden Markov models; pattern clustering; EM; HMRF; NE extractor; WordNet; constrained coclustering method; constrained text coclustering; document constraints; expectation maximization algorithm; information-theoretic coclustering; knowledge sources; manually labeled constraints; overlapping named entities; semantic distance; supervised constraints; two-sided hidden Markov random field model; unsupervised constraints; word constraints; Clustering algorithms; Clustering methods; Computational modeling; Hidden Markov models; Humans; Semantics; Sparse matrices; Constrained clustering; coclustering; text clustering; unsupervised constraints;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.45
  • Filename
    6165284