DocumentCode :
3781772
Title :
An Empirical Comparison of Three Ensemble Methods for Medical Data Mining with Apache Spark
Author :
Yiang Hua;Jian Pan;Zhaofeng Yan;Yunwei Qiu
Author_Institution :
Coll. of Comput. Sci. &
fYear :
2015
Firstpage :
917
Lastpage :
922
Abstract :
Medical data in various organizational forms are voluminous and heterogeneous, so it is highly meaningful to utilize parallel computing platforms to speed up the data mining procedure. In addition, ensemble methods which combine different weak classifiers together can improve classification accuracy on parallel computing platforms. In this paper, three ensemble methods (Bagging, AdaBoost and Logit Boost) with logistic regression as the weak classifier are implemented with Apache Spark for achieving better parallel computing performance and taking full advantage of RDD. And a series of experiments are carried out in different execution modes to evaluate and compare the classification performance and the parallelism of these ensemble methods. Experimental results indicate that although Bagging is slightly inferior to AdaBoost and Logit Boost in classification accuracy, it achieves better parallelism than the other two methods. Finally, selection criteria of these ensemble methods are presented in accordance with specific medical application scenario.
Keywords :
"Sparks","Bagging","Parallel processing","Logistics","Training","Data mining","Classification algorithms"
Publisher :
ieee
Conference_Titel :
Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on
Type :
conf
DOI :
10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.175
Filename :
7518354
Link To Document :
بازگشت