Title :
Using Correlation Based Subspace Clustering for Multi-label Text Data Classification
Author :
Ahmed, Mohammad Salim ; Khan, Latifur ; Rajeswari, Mandava
Author_Institution :
Dept. of Comput. Sci., Univ. of Texas at Dallas, Dallas, TX, USA
Abstract :
With the boom of web and social networking, the amount of generated text data has increased enormously. Much of this data can be considered and modeled as a stream and the volume of such data necessitates the application of automated text classification strategies. Although streaming data classification is not new, considering text data streams for classification purposes has been extensively researched only recently. Before applying any classification method in text data streams, it is imperative that we apply them for existing well-known non-stream text data sets and evaluate their performance. One of the many characteristics of text data that has been pursued for research is its multi-labelity. A single text document may cover multiple class-labels at the same time and hence gives rise to the concept of multi-labelity. From classification perspective, an immediate drawback of such a characteristic is that traditional binary or multi-class classification techniques perform poorly on multi-label text data. In this paper, we extend our previously formulated SISC (Semi-supervised Impurity based Subspace Clustering) [1] approach and its multi-label variation SISC-ML [2]. We call this new algorithm H-SISC (Hierarchical SISC). H-SISC captures the underlying correlation that exists between each pair of class labels in a multi-label environment. Developing a robust multi-label classifier will allow us to apply such a model in classifying streaming text data more effectively. We have experimented with well known text data sets and empirical evaluation on these real world multi-label NASA ASRS (Aviation Safety Reporting System), Reuters and 20 Newsgroups data sets reveals that our proposed approach outperforms other state-of-the-art text classification as well as subspace clustering algorithms.
Keywords :
correlation methods; pattern classification; pattern clustering; aviation safety reporting system; correlation; hierarchical SISC; multilabel text data classification; semisupervised impurity based subspace clustering; state of the art text classification; streaming data classification; Aircraft; Clustering algorithms; Correlation; Impurities; Indexing; Speech recognition; Support vector machines; Cluster Impurity; Fuzzy Clustering; Subspace Clustering; Text Data Classification;
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2010 22nd IEEE International Conference on
Conference_Location :
Arras
Print_ISBN :
978-1-4244-8817-9
DOI :
10.1109/ICTAI.2010.115