Title :
Protein remote homology detection based on latent topic vector model
Author :
Yeh, Jian-Hua ; Chen, Chun-Hsing
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Aletheia Univ., Taipei, Taiwan
Abstract :
Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method incorporating Support Vector Machine (SVM) is one of the most effective methods. Many of SVM-based methods focus on finding useful representations of protein sequences, using either explicit feature vector representations or kernel functions. In this paper, we focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. The experiment uses protein database from Structural Classification of Proteins version(SCOP) 1.53 with latent topic extraction technique (Latent Dirichlet Allocation model) which is an efficient feature extraction technique from natural language processing. The basic building blocks of our model are word documents generated from protein sequence by N-gram segmentation and filtered by TF-IDF method. Then the LDA phase applies on these documents for latent topic extraction while the SVM method acts as a classifier of latent topic. In our experiment, the LDA-SVM model outperforms than LSA-SVM model in the previous research.
Keywords :
biology computing; boundary-value problems; document handling; feature extraction; natural language processing; pattern classification; support vector machines; Latent Dirichlet Allocation model; N-gram segmentation; SVM protein classification; TF-IDF method; computational biology; feature extraction; feature vector representations; kernel functions; latent topic extraction; latent topic vector model; protein remote homology detection; protein sequence; proteins structural classification; support vector machine; Biological system modeling; Computational biology; Feature extraction; Kernel; Natural language processing; Proteins; Sequences; Spatial databases; Support vector machine classification; Support vector machines; Latent Dirichlet Allocation; Support Vector Machine; latent topic; protein sequence; remote homology;
Conference_Titel :
Networking and Information Technology (ICNIT), 2010 International Conference on
Conference_Location :
Manila
Print_ISBN :
978-1-4244-7579-7
Electronic_ISBN :
978-1-4244-7578-0
DOI :
10.1109/ICNIT.2010.5508474