• DocumentCode
    659603
  • Title

    Colon cancer survival prediction using ensemble data mining on SEER data

  • Author

    Al-Bahrani, Reda ; Agrawal, Ankit ; Choudhary, Alok

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Northwestern Univ., Evanston, IL, USA
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    9
  • Lastpage
    16
  • Abstract
    We analyze the colon cancer data available from the SEER program with the aim of developing accurate survival prediction models for colon cancer. Carefully designed preprocessing steps resulted in removal of several attributes and applying several supervised classification methods. We also adopt synthetic minority over-sampling technique (SMOTE) to balance the survival and non-survival classes we have. In our experiments, ensemble voting of the three of the top performing classifiers was found to result in the best prediction performance in terms of prediction accuracy and area under the ROC curve. We evaluated multiple classification schemes to estimate the risk of mortality after 1 year, 2 years and 5 years of diagnosis, on a subset of 65 attributes after the data clean up process, 13 attribute carefully selected using attribute selection techniques, and SMOTE balanced set of the same 13 attributes, while trying to retain the predictive power of the original set of attributes. Moreover, we demonstrate the importance of balancing the classes of the data set to yield better results.
  • Keywords
    cancer; data analysis; data mining; medical computing; pattern classification; ROC curve; SEER data; SEER program; SMOTE balanced set; Surveillance, Epidemiology, and End Results Program; attribute selection techniques; colon cancer data; colon cancer survival prediction; ensemble data mining; multiple classification schemes; prediction accuracy; supervised classification methods; synthetic minority over-sampling technique; Accuracy; Cancer; Colon; Data mining; Decision trees; Logistics; Predictive models; Colon Cancer; Ensemble; Machine Learning; Prediction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691752
  • Filename
    6691752