Title :
Optimising Bootstrapping Algorithms Using R and Hadoop
Author :
Shicai Wang ; Mares, Mihaela A. ; Yike Guo
Author_Institution :
Data Sci. Inst., Imperial Coll. London, London, UK
fDate :
June 29 2015-July 2 2015
Abstract :
A key research problem in machine learning and statistics today is feature or variable selection when the number of samples is relatively small to the number of features. Resampling methods such as the Bootstrap are used in this context to mimic the availability of multiple datasets by resampling from the same unique dataset. On one hand, some algorithms based on resampling, such as Bolas so, have been shown to decrease error as the number of bootstrap replicates increases. On the other hand, we expect an increase in dataset size in most of research domains. Therefore there is a demand for a large number of algorithm runs on several data replicates, and with the expected increase in dataset sizes, high performance parallel optimisation becomes mandatory. In this paper, we introduce an efficient data distribution and load balanced parallel calculation for the Bolas so algorithm based on R and HDFS. We study the performance on a large dataset consisting of 300 samples and 10000 features. The performance evaluation found that the new R on HDFS and its implementation in Snowfall and RHDFS outperforms the conventional algorithm with Linux EXT4. We conclude that R on HDFS holds great promise for methods based on resampling or bootstrapping, in particular when the number of dataset replications decreases the algorithm error, such as we demonstrated in the performance evaluation of this paper.
Keywords :
computer bootstrapping; data handling; parallel processing; Bolasso algorithm; HDFS; Hadoop; Linux EXT4; bootstrapping algorithms; data distribution; dataset replications; feature selection; load balanced parallel calculation; machine learning; performance evaluation; resampling methods; variable selection; Distributed databases; Dynamic scheduling; Performance evaluation; Prediction algorithms; Schedules; Snow; Sparks;
Conference_Titel :
Distributed Computing Systems Workshops (ICDCSW), 2015 IEEE 35th International Conference on
Conference_Location :
Columbus, OH
DOI :
10.1109/ICDCSW.2015.34