• DocumentCode
    8309
  • Title

    A Novel Variable-order Markov Model for Clustering Categorical Sequences

  • Author

    Tengke Xiong ; Shengrui Wang ; Qingshan Jiang ; Huang, Joshua Zhexue

  • Author_Institution
    Shenzhen Inst. of Adv. Technol., Shenzhen, China
  • Volume
    26
  • Issue
    10
  • fYear
    2014
  • fDate
    Oct. 2014
  • Firstpage
    2339
  • Lastpage
    2353
  • Abstract
    Clustering categorical sequences is an important and difficult data mining task. Despite recent efforts, the challenge remains, due to the lack of an inherently meaningful measure of pairwise similarity. In this paper, we propose a novel variable-order Markov framework, named weighted conditional probability distribution (WCPD), to model clusters of categorical sequences. We propose an efficient and effective approach to solve the challenging problem of model initialization. To initialize the WCPD model, we propose to use a first-order Markov model built on a weighted fuzzy indicator vector representation of categorical sequences, which we call the WFI Markov model. Based on a cascade optimization framework that combines the WCPD and WFI models, we design a new divisive hierarchical clustering algorithm for clustering categorical sequences. Experimental results on data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.
  • Keywords
    Markov processes; data mining; fuzzy set theory; optimisation; pattern clustering; WCPD; WFI Markov model; cascade optimization framework; clustering algorithm; clustering categorical sequences; data mining; novel variable-order Markov model; weighted conditional probability distribution; weighted fuzzy indicator vector representation; Clustering algorithms; Data models; Hidden Markov models; Markov processes; Numerical models; Probability; Silicon; Clustering; Computing Methodologies; Data mining; Database Applications; Database Management; Information Technology and Systems; Models; Pattern Recognition; Statistical; Statistical model; categorical sequence; clustering; similarity measure;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2013.104
  • Filename
    6547142