• DocumentCode
    31771
  • Title

    Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications

  • Author

    Pengyi Yang ; Yoo, Paul D. ; Fernando, Jude ; Zhou, Bing Bing ; Zili Zhang ; Zomaya, Albert Y.

  • Author_Institution
    Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia
  • Volume
    44
  • Issue
    3
  • fYear
    2014
  • fDate
    Mar-14
  • Firstpage
    445
  • Lastpage
    455
  • Abstract
    Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.
  • Keywords
    bioinformatics; learning (artificial intelligence); optimisation; pattern classification; SSO technique; bioinformatics applications; cross-validation procedure; data sampling; ensemble classifier; ensemble learning problems; imbalanced learning problems; random resampling; sample quality; sample subset optimization techniques; under-sampling technique; usefulness; Bioinformatics; Optimization; Protein engineering; Proteins; Sociology; Statistics; Training; Bioinformatics applications; ensemble learning; imbalanced learning; sample subset optimization (SSO); under-sampling;
  • fLanguage
    English
  • Journal_Title
    Cybernetics, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    2168-2267
  • Type

    jour

  • DOI
    10.1109/TCYB.2013.2257480
  • Filename
    6615954