Title :
A Lazy Data Mining Approach for Protein Classification
Author :
Merschmann, Luiz ; Plastino, Alexandre
Author_Institution :
Dept. of Comput. Sci., Univ. Fed. Fluminense, Niteroi
fDate :
3/1/2007 12:00:00 AM
Abstract :
In this work, we propose a new computational technique to solve the protein classification problem. The goal is to predict the functional family of novel protein sequences based on their motif composition. In order to improve the results obtained with other known approaches, we propose a new data mining technique for protein classification based on Bayes´ theorem, called highest subset probability (HiSP). To evaluate our proposal, datasets extracted from Prosite, a curated protein family database, are used as experimental datasets. The computational results have shown that the proposed method outperforms other known methods for all tested datasets and looks very promising for problems with characteristics similar to the problem addressed here. In addition, our experiments suggest that HiSP performs well on highly imbalanced datasets
Keywords :
Bayes methods; biology computing; data mining; molecular biophysics; probability; proteins; Bayes theorem; Prosite; curated protein family database; highest subset probability; lazy data mining; motif composition; protein classification; protein sequences; Amino acids; Computer science; Data mining; Databases; Decision trees; Information resources; Learning automata; Learning systems; Protein sequence; Testing; Data mining; lazy learning; protein classification; Algorithms; Amino Acid Sequence; Database Management Systems; Databases, Protein; Information Storage and Retrieval; Molecular Sequence Data; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Homology, Amino Acid;
Journal_Title :
NanoBioscience, IEEE Transactions on
DOI :
10.1109/TNB.2007.891910