DocumentCode :
1784897
Title :
Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
Author :
Stanescu, Ana ; Caragea, Doina
Author_Institution :
Dept. of Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS, USA
fYear :
2014
fDate :
2-5 Nov. 2014
Firstpage :
432
Lastpage :
437
Abstract :
Producing accurate classifiers depends on the quality and quantity of labeled data. The lack of labeled data, due to its expensive generation, critically affects the application of machine learning algorithms to biological problems. However, unlabeled data may be acquired relatively faster and in larger quantities thanks to current biochemical technologies, called Next Generation Sequencing. In such cases, when the number of labeled instances is overwhelmed by the number of unlabeled instances, semi-supervised learning represents a cost-effective alternative that can improve supervised classifiers by utilizing unlabeled data. In practice, data oftentimes exhibits imbalanced class distributions, which represents an obstacle for both supervised and semi-supervised learning. The problem of supervised learning from imbalanced datasets has been extensively studied, and various solutions have been proposed to produce classifiers with optimal performance on highly skewed class distributions. In the case of semi-supervised learning, there are not as many efforts aimed at the imbalance data problem. In this paper, we study several ensemble-based semi-supervised learning approaches for predicting splice sites, a problem for which the imbalance ratio is very high. We run experiments on five imbalanced datasets with the goal of identifying which variants are the most effective.
Keywords :
biology computing; data handling; learning (artificial intelligence); pattern classification; biochemical technologies; biological problems; cost-effective alternative; ensemble-based semisupervised learning; highly skewed class distributions; imbalance data problem; imbalance ratio; imbalanced class distributions; imbalanced splice site datasets; machine learning algorithms; next generation sequencing; optimal performance; supervised classifiers; unlabeled data; unlabeled instances; DNA; Organisms; Proteins; Semisupervised learning; Supervised learning; Support vector machines; Training; ensemble; imbalanced datasets; self-training; semi-supervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
Conference_Location :
Belfast
Type :
conf
DOI :
10.1109/BIBM.2014.6999196
Filename :
6999196
Link To Document :
بازگشت