An Empirical Comparison of Three Ensemble Methods for Medical Data Mining with Apache Spark

Author

Yiang Hua;Jian Pan;Zhaofeng Yan;Yunwei Qiu

Author_Institution

Coll. of Comput. Sci. &

fYear

2015

Firstpage

917

Lastpage

922

Abstract

Medical data in various organizational forms are voluminous and heterogeneous, so it is highly meaningful to utilize parallel computing platforms to speed up the data mining procedure. In addition, ensemble methods which combine different weak classifiers together can improve classification accuracy on parallel computing platforms. In this paper, three ensemble methods (Bagging, AdaBoost and Logit Boost) with logistic regression as the weak classifier are implemented with Apache Spark for achieving better parallel computing performance and taking full advantage of RDD. And a series of experiments are carried out in different execution modes to evaluate and compare the classification performance and the parallelism of these ensemble methods. Experimental results indicate that although Bagging is slightly inferior to AdaBoost and Logit Boost in classification accuracy, it achieves better parallelism than the other two methods. Finally, selection criteria of these ensemble methods are presented in accordance with specific medical application scenario.

Keywords

"Sparks","Bagging","Parallel processing","Logistics","Training","Data mining","Classification algorithms"

Publisher

ieee

Conference_Titel

Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on

Type

conf

DOI

10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.175

Filename

7518354