• DocumentCode
    2596120
  • Title

    A Time Series Approach for Identification of Exons and Introns

  • Author

    Gupta, Ravi ; Mittal, Ankush ; Singh, Kuldip ; Bajpai, Prateek ; Prakash, Suraj

  • Author_Institution
    Indian Inst. of Technol. Roorkee, Uttarakhand
  • fYear
    2007
  • fDate
    17-20 Dec. 2007
  • Firstpage
    91
  • Lastpage
    93
  • Abstract
    The classification of an organism gene sequence into coding and non-coding regions is a challenging task in DNA sequence analysis. The classification algorithms operate on the basic assumptions that every protein coding regions should have some distinct sequence features or properties that can distinguish it from the surrounding regions, such as non-coding regions and intergenic regions. In this study, we present a novel and generic approach for analysis of DNA sequences. A wavelet based time series approach is proposed for extracting statistical information from DNA sequences. The extracted information contains the variance information of amino/keto, purine/pyrimidine and weak/strong hydrogen bond distribution in a DNA sequence. The variance information is further used to construct a feature vector and a pattern recognition framework is applied for classifying exons and introns. An optimized support vector machine (SVM) classifier based on novel features is constructed for accurate classification of DNA sequences. Experiments were performed on exons and introns dataset of Homo sapiens and a 10-fold cross-validation accuracy of 87.5% was achieved. Further, test conducted were also conducted on unseen dataset of exons and introns of Homo sapiens and an accuracy of 88.95% was reported.
  • Keywords
    DNA; biology computing; feature extraction; genetics; molecular biophysics; optimisation; pattern classification; proteins; sequences; statistical analysis; support vector machines; time series; wavelet transforms; DNA sequence analysis; exons-introns identification; feature vector; optimized support vector machine classifier; organism gene sequence classification; pattern recognition; protein coding region; statistical information extraction; wavelet based time series approach; Bonding; Classification algorithms; DNA; Data mining; Hydrogen; Organisms; Proteins; Sequences; Support vector machine classification; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology, (ICIT 2007). 10th International Conference on
  • Conference_Location
    Orissa
  • Print_ISBN
    0-7695-3068-0
  • Type

    conf

  • DOI
    10.1109/ICIT.2007.54
  • Filename
    4418274