Mining the Big Data: The Critical Feature Dimension Problem

Author

Qingzhong Liu ; Sung, Andrew H. ; Ribeiro, Bernardete ; Suryakumar, Divya

Author_Institution

Dept. of Comput. Sci., Sam Houston State Univ., Huntsville, TX, USA

fYear

2014

fDate

Aug. 31 2014-Sept. 4 2014

Firstpage

499

Lastpage

504

Abstract

In mining massive datasets, often two of the most important and immediate problems are sampling and feature selection. Proper sampling and feature selection contributes to reducing the size of the dataset while obtaining satisfactory results in model building. Theoretically, therefore, it is interesting to investigate whether a given dataset possesses a critical feature dimension, or the minimum number of features that is required for a given learning machine to achieve "satisfactory" performance. (Likewise, the critical sampling size problem concerns whether, for a given dataset, there is a minimum number of data points that must be included in any sample for a learning machine to achieve satisfactory performance.) Here the specific meaning of "satisfactory" performance is to be defined by the user. This paper addresses the complexity of both problems in one general theoretical setting and shows that they have the same complexity and are highly intractable. Next, an empirical method is applied in an attempt to find the approximate critical feature dimension of datasets. It is demonstrated that, under generally reasonable assumptions pertaining to feature ranking algorithms, the critical feature dimension are successfully discovered by the empirical method for a number of datasets of various sizes. The results are encouraging in achieving significant feature size reduction and point to a promising way in dealing with big data. The significance of the existence of crucial dimension in datasets is also explained.

Keywords

Big Data; data mining; feature selection; learning (artificial intelligence); Big Data mining; feature dimension problem; feature ranking algorithms; feature selection; learning machine; Accuracy; Classification algorithms; Complexity theory; Data mining; Educational institutions; Electromagnetic interference; Vectors; critical dimension; data mining; dimension reduction; feature ranking; machine learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on

Conference_Location

Kitakyushu

Print_ISBN

978-1-4799-4174-2

Type

conf

DOI

10.1109/IIAI-AAI.2014.105

Filename

6913349