Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Author

Stanescu, Ana ; Caragea, Doina

Author_Institution

Dept. of Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS, USA

fYear

2014

fDate

2-5 Nov. 2014

Firstpage

432

Lastpage

437

Abstract

Producing accurate classifiers depends on the quality and quantity of labeled data. The lack of labeled data, due to its expensive generation, critically affects the application of machine learning algorithms to biological problems. However, unlabeled data may be acquired relatively faster and in larger quantities thanks to current biochemical technologies, called Next Generation Sequencing. In such cases, when the number of labeled instances is overwhelmed by the number of unlabeled instances, semi-supervised learning represents a cost-effective alternative that can improve supervised classifiers by utilizing unlabeled data. In practice, data oftentimes exhibits imbalanced class distributions, which represents an obstacle for both supervised and semi-supervised learning. The problem of supervised learning from imbalanced datasets has been extensively studied, and various solutions have been proposed to produce classifiers with optimal performance on highly skewed class distributions. In the case of semi-supervised learning, there are not as many efforts aimed at the imbalance data problem. In this paper, we study several ensemble-based semi-supervised learning approaches for predicting splice sites, a problem for which the imbalance ratio is very high. We run experiments on five imbalanced datasets with the goal of identifying which variants are the most effective.

Keywords

biology computing; data handling; learning (artificial intelligence); pattern classification; biochemical technologies; biological problems; cost-effective alternative; ensemble-based semisupervised learning; highly skewed class distributions; imbalance data problem; imbalance ratio; imbalanced class distributions; imbalanced splice site datasets; machine learning algorithms; next generation sequencing; optimal performance; supervised classifiers; unlabeled data; unlabeled instances; DNA; Organisms; Proteins; Semisupervised learning; Supervised learning; Support vector machines; Training; ensemble; imbalanced datasets; self-training; semi-supervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on

Conference_Location

Belfast

Type

conf

DOI

10.1109/BIBM.2014.6999196

Filename

6999196