DocumentCode
2596120
Title
A Time Series Approach for Identification of Exons and Introns
Author
Gupta, Ravi ; Mittal, Ankush ; Singh, Kuldip ; Bajpai, Prateek ; Prakash, Suraj
Author_Institution
Indian Inst. of Technol. Roorkee, Uttarakhand
fYear
2007
fDate
17-20 Dec. 2007
Firstpage
91
Lastpage
93
Abstract
The classification of an organism gene sequence into coding and non-coding regions is a challenging task in DNA sequence analysis. The classification algorithms operate on the basic assumptions that every protein coding regions should have some distinct sequence features or properties that can distinguish it from the surrounding regions, such as non-coding regions and intergenic regions. In this study, we present a novel and generic approach for analysis of DNA sequences. A wavelet based time series approach is proposed for extracting statistical information from DNA sequences. The extracted information contains the variance information of amino/keto, purine/pyrimidine and weak/strong hydrogen bond distribution in a DNA sequence. The variance information is further used to construct a feature vector and a pattern recognition framework is applied for classifying exons and introns. An optimized support vector machine (SVM) classifier based on novel features is constructed for accurate classification of DNA sequences. Experiments were performed on exons and introns dataset of Homo sapiens and a 10-fold cross-validation accuracy of 87.5% was achieved. Further, test conducted were also conducted on unseen dataset of exons and introns of Homo sapiens and an accuracy of 88.95% was reported.
Keywords
DNA; biology computing; feature extraction; genetics; molecular biophysics; optimisation; pattern classification; proteins; sequences; statistical analysis; support vector machines; time series; wavelet transforms; DNA sequence analysis; exons-introns identification; feature vector; optimized support vector machine classifier; organism gene sequence classification; pattern recognition; protein coding region; statistical information extraction; wavelet based time series approach; Bonding; Classification algorithms; DNA; Data mining; Hydrogen; Organisms; Proteins; Sequences; Support vector machine classification; Support vector machines;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology, (ICIT 2007). 10th International Conference on
Conference_Location
Orissa
Print_ISBN
0-7695-3068-0
Type
conf
DOI
10.1109/ICIT.2007.54
Filename
4418274
Link To Document