Title of article

Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering

Author/Authors

Bharti، نويسنده , , Kusum Kumari and Singh، نويسنده , , Pramod Kumar، نويسنده ,

Issue Information

روزنامه با شماره پیاپی سال 2015

Pages

From page

3105

To page

3114

Abstract

High dimensionality of the feature space is one of the major concerns owing to computational complexity and accuracy consideration in the text clustering. Therefore, various dimension reduction methods have been introduced in the literature to select an informative subset (or sublist) of features. As each dimension reduction method uses a different strategy (aspect) to select a subset of features, it results in different feature sublists for the same dataset. Hence, a hybrid approach, which encompasses different aspects of feature relevance altogether for feature subset selection, receives considerable attention. Traditionally, union or intersection is used to merge feature sublists selected with different methods. The union approach selects all features and the intersection approach selects only common features from considered features sublists, which leads to increase the total number of features and loses some important features, respectively. Therefore, to take the advantage of one method and lessen the drawbacks of other, a novel integration approach namely modified union is proposed. This approach applies union on selected top ranked features and applies intersection on remaining features sublists. Hence, it ensures selection of top ranked as well as common features without increasing dimensions in the feature space much. In this study, feature selection methods term variance (TV) and document frequency (DF) are used for features’ relevance score computation. Next, a feature extraction method principal component analysis (PCA) is applied to further reduce dimensions in the feature space without losing much information. The effectiveness of the proposed method is tested on three benchmark datasets namely Reuters-21,578, Classic4, and WebKB. The obtained results are compared with TV, DF, and variants of the proposed hybrid dimension reduction method. The experimental studies clearly demonstrate that our proposed method improves clustering accuracy compared to the competitive methods.

Keywords

Document frequency , Principal component analysis , feature extraction , feature selection , Term variance , Text clustering

Journal title

Expert Systems with Applications

Serial Year

2015

Journal title

Expert Systems with Applications

Record number

2355750

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2355750