• DocumentCode
    167293
  • Title

    Multiclass unbalanced protein data classification using sequence features

  • Author

    Vani, K. Suvarna ; Sravani, T.D.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., V.R. Siddhartha Eng. Coll., Vijayawada, India
  • fYear
    2014
  • fDate
    21-24 May 2014
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm called feature extraction algorithm is proposed to extract novel features from the primary sequences. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm like SMOTE technique of Chawla et al. [17] is applied to rebalance the data set and then apply different classifiers methods like J48 [15] decision tree classifier is used to classify folds from the features of sequences. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the sequences alone is to extract features based on the protein sequences and apply the extracted feature set to the improved oversampling method which reduces the imbalance present in the extracted feature set. In order to tackle the multi-classes we use different boosting algorithms like Adaboost and Logitboost which handle multi-datasets effectively.
  • Keywords
    bioinformatics; decision trees; feature extraction; learning (artificial intelligence); molecular biophysics; molecular configurations; pattern classification; proteins; support vector machines; Adaboost; J48; Logitboost; SMOTE technique; SVM; bioinformatics; boosting algorithm; computationally inexpensive algorithm; dataset rebalance; decision tree classifier; feature extraction algorithm; fold classification problem; major protein structural classes; multiclass classifier; multiclass problem; multiclass unbalanced protein data classification; oversampling method; protein fold classification; sequence features; support vector machine; unbalanced classes; Accuracy; Amino acids; Boosting; Clustering algorithms; Feature extraction; Proteins; Vectors; AdaBoost; Feature Extraction; LogitBoost; Oversampling; Protein fold classification; SMOTE; Unbalanced data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on
  • Conference_Location
    Honolulu, HI
  • Type

    conf

  • DOI
    10.1109/CIBCB.2014.6845517
  • Filename
    6845517