Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Asia Univ., Taichung, Taiwan
Abstract :
In this study, instead of traditional approaches to virus classification, we proposed a novel approach in the vector space model for virus classification via two types of genome sequences, DNA and CDS. For DNA sequence, in this study, the k-mer approach was adopted for pattern extraction and the entropy of the pattern frequency distribution among classes was for pattern weighting. For CDS sequence, however, the pattern extraction was based on the identification of distinctive protein functions which were formed by CDS clustering and a weighting method, similar to tf * idf approach, for these protein functions was proposed. The experimental resources were download from NCBI and there were 35 classes (virus family) consisted of 1,877 viruses selected. The highest values of classification accuracy via SVM classifier were as high as 94.7% and 91.3% via DNA and CDS sequences, respectively. This study not only proposed a novel approach for virus classification but also provided a new methodology for comparative genomic analysis.
Keywords :
DNA; biology computing; cellular biophysics; genomics; microorganisms; molecular biophysics; physiological models; proteins; support vector machines; CDS clustering; DNA sequence; SVM classifier; classification accuracy; comparative genomic analysis; genome sequences; k-mer approach; pattern extraction; pattern frequency distribution; pattern weighting; protein functions; vector space model; virus classification; Accuracy; Bioinformatics; DNA; Encoding; Genomics; Vectors; Viruses (medical); Comparative genomics; genome sequence; virus classification;