DocumentCode :
441997
Title :
A pattern-based SVM for protein remote homology detection
Author :
Dong, Qi-Wen ; Lin, Lei ; Wang, Xiao-long ; Li, Ming-Hui
Author_Institution :
Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., China
Volume :
6
fYear :
2005
fDate :
18-21 Aug. 2005
Firstpage :
3363
Abstract :
One key element in understanding the molecular machinery of the cell is to understand the structure and function of each protein encoded in the genome. A very successful means of inferring the structure or function of a previously un-annotated protein is via sequence homology with one or more protein whose structure or function is already known. In this paper, a novel method for protein remote homology detection has been presented. The technologies of text categorization from natural language processing have been used in protein classification. Patterns are discovered by TEIRESIAS algorithm and can be viewed as the "words" of "protein sequence language". The patterns are then filtered by an efficient feature selection algorithm called chi-square algorithm. Each protein sequence is mapped into a high dimensional vector by the occurrence times of the selected patterns. This presentation, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means for protein remote homology detection. The method, called SVM-pattern, is tested on the SCOP database and compared with other state-of-the-art methods. The performance of SVM-pattern is better than that of BLAST method and comparable with other SVM-based methods such as SVM-k-spectrum and SVM-pairwise.
Keywords :
biology computing; data mining; pattern classification; proteins; support vector machines; text analysis; SVM; chi-square algorithm; discriminative classification algorithm; feature selection; genome; molecular machinery; natural language processing; pattern discovery; protein classification; protein sequence language; remote homology detection; support vector machine; text categorization; Bioinformatics; Classification algorithms; Genomics; Machinery; Natural language processing; Protein sequence; Support vector machine classification; Support vector machines; Testing; Text categorization; Protein; pattern; remote homology; text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
Conference_Location :
Guangzhou, China
Print_ISBN :
0-7803-9091-1
Type :
conf
DOI :
10.1109/ICMLC.2005.1527523
Filename :
1527523
Link To Document :
بازگشت