• DocumentCode
    3781772
  • Title

    An Empirical Comparison of Three Ensemble Methods for Medical Data Mining with Apache Spark

  • Author

    Yiang Hua;Jian Pan;Zhaofeng Yan;Yunwei Qiu

  • Author_Institution
    Coll. of Comput. Sci. &
  • fYear
    2015
  • Firstpage
    917
  • Lastpage
    922
  • Abstract
    Medical data in various organizational forms are voluminous and heterogeneous, so it is highly meaningful to utilize parallel computing platforms to speed up the data mining procedure. In addition, ensemble methods which combine different weak classifiers together can improve classification accuracy on parallel computing platforms. In this paper, three ensemble methods (Bagging, AdaBoost and Logit Boost) with logistic regression as the weak classifier are implemented with Apache Spark for achieving better parallel computing performance and taking full advantage of RDD. And a series of experiments are carried out in different execution modes to evaluate and compare the classification performance and the parallelism of these ensemble methods. Experimental results indicate that although Bagging is slightly inferior to AdaBoost and Logit Boost in classification accuracy, it achieves better parallelism than the other two methods. Finally, selection criteria of these ensemble methods are presented in accordance with specific medical application scenario.
  • Keywords
    "Sparks","Bagging","Parallel processing","Logistics","Training","Data mining","Classification algorithms"
  • Publisher
    ieee
  • Conference_Titel
    Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on
  • Type

    conf

  • DOI
    10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.175
  • Filename
    7518354