• DocumentCode
    1784897
  • Title

    Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

  • Author

    Stanescu, Ana ; Caragea, Doina

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS, USA
  • fYear
    2014
  • fDate
    2-5 Nov. 2014
  • Firstpage
    432
  • Lastpage
    437
  • Abstract
    Producing accurate classifiers depends on the quality and quantity of labeled data. The lack of labeled data, due to its expensive generation, critically affects the application of machine learning algorithms to biological problems. However, unlabeled data may be acquired relatively faster and in larger quantities thanks to current biochemical technologies, called Next Generation Sequencing. In such cases, when the number of labeled instances is overwhelmed by the number of unlabeled instances, semi-supervised learning represents a cost-effective alternative that can improve supervised classifiers by utilizing unlabeled data. In practice, data oftentimes exhibits imbalanced class distributions, which represents an obstacle for both supervised and semi-supervised learning. The problem of supervised learning from imbalanced datasets has been extensively studied, and various solutions have been proposed to produce classifiers with optimal performance on highly skewed class distributions. In the case of semi-supervised learning, there are not as many efforts aimed at the imbalance data problem. In this paper, we study several ensemble-based semi-supervised learning approaches for predicting splice sites, a problem for which the imbalance ratio is very high. We run experiments on five imbalanced datasets with the goal of identifying which variants are the most effective.
  • Keywords
    biology computing; data handling; learning (artificial intelligence); pattern classification; biochemical technologies; biological problems; cost-effective alternative; ensemble-based semisupervised learning; highly skewed class distributions; imbalance data problem; imbalance ratio; imbalanced class distributions; imbalanced splice site datasets; machine learning algorithms; next generation sequencing; optimal performance; supervised classifiers; unlabeled data; unlabeled instances; DNA; Organisms; Proteins; Semisupervised learning; Supervised learning; Support vector machines; Training; ensemble; imbalanced datasets; self-training; semi-supervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
  • Conference_Location
    Belfast
  • Type

    conf

  • DOI
    10.1109/BIBM.2014.6999196
  • Filename
    6999196