• DocumentCode
    259297
  • Title

    Mining the Big Data: The Critical Feature Dimension Problem

  • Author

    Qingzhong Liu ; Sung, Andrew H. ; Ribeiro, Bernardete ; Suryakumar, Divya

  • Author_Institution
    Dept. of Comput. Sci., Sam Houston State Univ., Huntsville, TX, USA
  • fYear
    2014
  • fDate
    Aug. 31 2014-Sept. 4 2014
  • Firstpage
    499
  • Lastpage
    504
  • Abstract
    In mining massive datasets, often two of the most important and immediate problems are sampling and feature selection. Proper sampling and feature selection contributes to reducing the size of the dataset while obtaining satisfactory results in model building. Theoretically, therefore, it is interesting to investigate whether a given dataset possesses a critical feature dimension, or the minimum number of features that is required for a given learning machine to achieve "satisfactory" performance. (Likewise, the critical sampling size problem concerns whether, for a given dataset, there is a minimum number of data points that must be included in any sample for a learning machine to achieve satisfactory performance.) Here the specific meaning of "satisfactory" performance is to be defined by the user. This paper addresses the complexity of both problems in one general theoretical setting and shows that they have the same complexity and are highly intractable. Next, an empirical method is applied in an attempt to find the approximate critical feature dimension of datasets. It is demonstrated that, under generally reasonable assumptions pertaining to feature ranking algorithms, the critical feature dimension are successfully discovered by the empirical method for a number of datasets of various sizes. The results are encouraging in achieving significant feature size reduction and point to a promising way in dealing with big data. The significance of the existence of crucial dimension in datasets is also explained.
  • Keywords
    Big Data; data mining; feature selection; learning (artificial intelligence); Big Data mining; feature dimension problem; feature ranking algorithms; feature selection; learning machine; Accuracy; Classification algorithms; Complexity theory; Data mining; Educational institutions; Electromagnetic interference; Vectors; critical dimension; data mining; dimension reduction; feature ranking; machine learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on
  • Conference_Location
    Kitakyushu
  • Print_ISBN
    978-1-4799-4174-2
  • Type

    conf

  • DOI
    10.1109/IIAI-AAI.2014.105
  • Filename
    6913349