• DocumentCode
    659583
  • Title

    How data partitioning strategies and subset size influence the performance of an ensemble?

  • Author

    Farrash, Majed ; Wenjia Wang

  • Author_Institution
    Sch. of Comput. Sci., Univ. of East Anglia, Norwich, UK
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    42
  • Lastpage
    49
  • Abstract
    When dealing with big data, “divide and conquer” is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or sampling techniques, it is not clear which one is suitable and how the size of subset may affect the performance of further analysis. In this paper, after presenting a generic framework of ensemble approach for learning from big data, we focus our investigations on systematically evaluating the effect of partitioning strategies and subset size on ensemble performance. The experimental results have demonstrated that three investigated partitioning / sampling strategies behaved statistically similar but the subset size may affect the performance of the ensemble in very drastically different ways, which are grouped into three patterns, rather than just one default perception - the bigger the better.
  • Keywords
    cloud computing; data handling; pattern clustering; big dataset; cloud computing systems; cluster systems; data partitioning strategies; subset size; Accuracy; Partitioning algorithms; Radiation detectors; Round robin; Support vector machines; Testing; Training; Big data; ensemble learning; partitioning; subset size;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691732
  • Filename
    6691732