Title :
SVM Learning from Imbalanced Data by GA Sampling for Protein Domain Prediction
Author :
Zou, Shuxue ; Huang, Yanxin ; Wang, Yan ; Wang, Jianxin ; Zhou, Chunguang
Author_Institution :
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun
Abstract :
The performance of support vector machines (SVM) drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Some researchers have pointed out that it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly. In this paper, we proposed a novel sampling method based on genetic algorithms (GA) to rebalance the imbalanced training dataset for SVM. In order to evaluating the final classifiers more impartiality, AUC (area under ROC curve) is employed as the fitness function in GA. The experimental results show that the sampling strategy based on GA outperforms the random sampling method. And our method is prior to individual SVM for imbalanced protein domain boundary prediction. The accuracy of the prediction is about 70% with the AUC value 0.905.
Keywords :
bioinformatics; genetic algorithms; learning (artificial intelligence); pattern classification; proteins; sampling methods; support vector machines; SVM learning; fitness function; genetic algorithm; imbalanced data; pattern classification; protein domain prediction; sampling method; support vector machine; Accuracy; Data processing; Educational institutions; Genetic algorithms; Kernel; Machine learning; Protein engineering; Sampling methods; Support vector machine classification; Support vector machines; GA; Imbalanced data; Protein domain prediction; SVM; Sampling;
Conference_Titel :
Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for
Conference_Location :
Hunan
Print_ISBN :
978-0-7695-3398-8
Electronic_ISBN :
978-0-7695-3398-8
DOI :
10.1109/ICYCS.2008.72