DocumentCode
259297
Title
Mining the Big Data: The Critical Feature Dimension Problem
Author
Qingzhong Liu ; Sung, Andrew H. ; Ribeiro, Bernardete ; Suryakumar, Divya
Author_Institution
Dept. of Comput. Sci., Sam Houston State Univ., Huntsville, TX, USA
fYear
2014
fDate
Aug. 31 2014-Sept. 4 2014
Firstpage
499
Lastpage
504
Abstract
In mining massive datasets, often two of the most important and immediate problems are sampling and feature selection. Proper sampling and feature selection contributes to reducing the size of the dataset while obtaining satisfactory results in model building. Theoretically, therefore, it is interesting to investigate whether a given dataset possesses a critical feature dimension, or the minimum number of features that is required for a given learning machine to achieve "satisfactory" performance. (Likewise, the critical sampling size problem concerns whether, for a given dataset, there is a minimum number of data points that must be included in any sample for a learning machine to achieve satisfactory performance.) Here the specific meaning of "satisfactory" performance is to be defined by the user. This paper addresses the complexity of both problems in one general theoretical setting and shows that they have the same complexity and are highly intractable. Next, an empirical method is applied in an attempt to find the approximate critical feature dimension of datasets. It is demonstrated that, under generally reasonable assumptions pertaining to feature ranking algorithms, the critical feature dimension are successfully discovered by the empirical method for a number of datasets of various sizes. The results are encouraging in achieving significant feature size reduction and point to a promising way in dealing with big data. The significance of the existence of crucial dimension in datasets is also explained.
Keywords
Big Data; data mining; feature selection; learning (artificial intelligence); Big Data mining; feature dimension problem; feature ranking algorithms; feature selection; learning machine; Accuracy; Classification algorithms; Complexity theory; Data mining; Educational institutions; Electromagnetic interference; Vectors; critical dimension; data mining; dimension reduction; feature ranking; machine learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on
Conference_Location
Kitakyushu
Print_ISBN
978-1-4799-4174-2
Type
conf
DOI
10.1109/IIAI-AAI.2014.105
Filename
6913349
Link To Document