• DocumentCode
    57915
  • Title

    Imbalanced Protein Data Classification Using Ensemble FTM-SVM

  • Author

    Hong-Liang Dai

  • Author_Institution
    Sch. of Math. & Stat., Guangdong Univ. of Finance & Econ., Guangzhou, China
  • Volume
    14
  • Issue
    4
  • fYear
    2015
  • fDate
    Jun-15
  • Firstpage
    350
  • Lastpage
    359
  • Abstract
    Classification of protein sequences into functional and structural families based on machine learning methods is a hot research topic in machine learning and Bioinformatics. In fact, the underlying protein classification problem is a huge multiclass problem. Generally, the multiclass problem can be reduced to a set of binary classification problems. The protein in one class are seen as positive examples while those outside the class are seen as negative examples. However, the class imbalance problem will arise in this case because the number of protein in one class is usually much smaller than that of the protein outside the class. To handle the challenge, we propose a novel framework to classify the protein. We firstly use free scores (FS) to perform feature extraction for protein; then, the inverse random under sampling (IRUS) is used to create a large number of distinct training sets; next, we use a new ensemble approach to combine these distinct training sets with a new fuzzy total margin support vector machine (FTM-SVM) that we have constructed. we call the novel ensemble classifier as ensemble fuzzy total margin support vector machine (EnFTM-SVM). We then give a full description of our method, including the details of its derivation. Finally, experimental results on fourteen benchmark protein data sets indicate that the proposed method outperforms many state-of-the-art protein classifying methods.
  • Keywords
    bioinformatics; feature extraction; proteins; proteomics; support vector machines; binary classification problems; bioinformatics; ensemble FTM-SVM; ensemble fuzzy total margin support vector machine; feature extraction; inverse random under sampling; machine learning methods; protein sequence classification; Feature extraction; Hidden Markov models; Noise; Protein sequence; Support vector machines; Training; Class imbalance; classification; ensemble; protein; support vector machine (SVM);
  • fLanguage
    English
  • Journal_Title
    NanoBioscience, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1536-1241
  • Type

    jour

  • DOI
    10.1109/TNB.2015.2431292
  • Filename
    7104161