• DocumentCode
    1680897
  • Title

    Using Correlation Based Subspace Clustering for Multi-label Text Data Classification

  • Author

    Ahmed, Mohammad Salim ; Khan, Latifur ; Rajeswari, Mandava

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Texas at Dallas, Dallas, TX, USA
  • Volume
    2
  • fYear
    2010
  • Firstpage
    296
  • Lastpage
    303
  • Abstract
    With the boom of web and social networking, the amount of generated text data has increased enormously. Much of this data can be considered and modeled as a stream and the volume of such data necessitates the application of automated text classification strategies. Although streaming data classification is not new, considering text data streams for classification purposes has been extensively researched only recently. Before applying any classification method in text data streams, it is imperative that we apply them for existing well-known non-stream text data sets and evaluate their performance. One of the many characteristics of text data that has been pursued for research is its multi-labelity. A single text document may cover multiple class-labels at the same time and hence gives rise to the concept of multi-labelity. From classification perspective, an immediate drawback of such a characteristic is that traditional binary or multi-class classification techniques perform poorly on multi-label text data. In this paper, we extend our previously formulated SISC (Semi-supervised Impurity based Subspace Clustering) [1] approach and its multi-label variation SISC-ML [2]. We call this new algorithm H-SISC (Hierarchical SISC). H-SISC captures the underlying correlation that exists between each pair of class labels in a multi-label environment. Developing a robust multi-label classifier will allow us to apply such a model in classifying streaming text data more effectively. We have experimented with well known text data sets and empirical evaluation on these real world multi-label NASA ASRS (Aviation Safety Reporting System), Reuters and 20 Newsgroups data sets reveals that our proposed approach outperforms other state-of-the-art text classification as well as subspace clustering algorithms.
  • Keywords
    correlation methods; pattern classification; pattern clustering; aviation safety reporting system; correlation; hierarchical SISC; multilabel text data classification; semisupervised impurity based subspace clustering; state of the art text classification; streaming data classification; Aircraft; Clustering algorithms; Correlation; Impurities; Indexing; Speech recognition; Support vector machines; Cluster Impurity; Fuzzy Clustering; Subspace Clustering; Text Data Classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence (ICTAI), 2010 22nd IEEE International Conference on
  • Conference_Location
    Arras
  • ISSN
    1082-3409
  • Print_ISBN
    978-1-4244-8817-9
  • Type

    conf

  • DOI
    10.1109/ICTAI.2010.115
  • Filename
    5670092