DocumentCode :
680732
Title :
Maximizing Classification Performance for Patient Response Datasets
Author :
Dittman, David J. ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2013
fDate :
4-6 Nov. 2013
Firstpage :
454
Lastpage :
462
Abstract :
The ability to predict a patient´s response to a treatment has long been a goal in the fields of medicine andpharmacology. This is especially true for cancer treatments, as many of these incur extreme side effects as a consequenceof destroying healthy cells along with cancerous ones. Geneprofiles such as DNA microarrays could potentially containinformation on which treatments are most likely to work withminimal side effects. However, DNA microarray datasets canbe challenging due to the large number of features (genes) per sample, many of which are irrelevant or redundant. Techniques from the domain of data mining may help both identifythe most important features and build classification modelsusing those features. This paper is a comprehensive study onthe relative performance of many different feature selectionapproaches and classification models when applied to fifteenpatient response datasets. We use six classifiers along withtwelve feature subset sizes and twenty-five feature selection techniques. Our results show that the Random Forest classifieris the top performing classifier in terms of both average resultsacross all feature selection techniques and when using thebest-performing feature selection technique, and also had thesmallest range between the best and worst performing featureselection techniques. Additionally, we found that for the averageand best feature selection technique performance, as the featuresubset size increases, the classification performance increases. Finally, we found that different feature selection techniquesdominated performance for different feature subset sizes, andlikewise the worst performers also depended on the chosenfeature subset size. Statistical analysis was conducted to furthervalidate our results. Overall, based on our results we wouldrecommend the use of Random Forest along with a featureselection technique (the choice not being statistically significant)that reduces the feature set to around 1000 features, in orderto b- th maximize classification performance and remove onestep (choosing an appropriate feature ranking technique) from the process.
Keywords :
DNA; cancer; cellular biophysics; data mining; feature selection; genetics; medical computing; molecular biophysics; patient treatment; pattern classification; random processes; statistical analysis; DNA microarray datasets; cancer treatment; classification model performance maximization; data mining; feature selection approach; feature subset size; gene profile; medicine; minimal side effects; patient response datasets; patient response prediction; pharmacology; random forest classifier; statistical analysis; DNA; Data mining; Data models; Logistics; Predictive models; Support vector machines; Vegetation; Classifiers; DNA Microarray; Patient Response; Random Forest;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on
Conference_Location :
Herndon, VA
ISSN :
1082-3409
Print_ISBN :
978-1-4799-2971-9
Type :
conf
DOI :
10.1109/ICTAI.2013.74
Filename :
6735285
Link To Document :
بازگشت