Title :
How data partitioning strategies and subset size influence the performance of an ensemble?
Author :
Farrash, Majed ; Wenjia Wang
Author_Institution :
Sch. of Comput. Sci., Univ. of East Anglia, Norwich, UK
Abstract :
When dealing with big data, “divide and conquer” is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or sampling techniques, it is not clear which one is suitable and how the size of subset may affect the performance of further analysis. In this paper, after presenting a generic framework of ensemble approach for learning from big data, we focus our investigations on systematically evaluating the effect of partitioning strategies and subset size on ensemble performance. The experimental results have demonstrated that three investigated partitioning / sampling strategies behaved statistically similar but the subset size may affect the performance of the ensemble in very drastically different ways, which are grouped into three patterns, rather than just one default perception - the bigger the better.
Keywords :
cloud computing; data handling; pattern clustering; big dataset; cloud computing systems; cluster systems; data partitioning strategies; subset size; Accuracy; Partitioning algorithms; Radiation detectors; Round robin; Support vector machines; Testing; Training; Big data; ensemble learning; partitioning; subset size;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691732