DocumentCode :
2710969
Title :
RBNBC: Repeat Based Naive Bayes Classifier for Biological Sequences
Author :
Rani, Pratibha ; Pudi, Vikram
Author_Institution :
Center for Data Eng., HIT Hyderabad, Hyderabad
fYear :
2008
fDate :
15-19 Dec. 2008
Firstpage :
989
Lastpage :
994
Abstract :
In this paper, we present RBNBC, a repeat based Naive Bayes classifier of bio-sequences that uses maximal frequent subsequences as features. RBNBC´s design is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically, RBNBC uses a novel formulation of Naive Bayes that incorporates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that it performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences. This is surprising as it is a pure data mining based generic classifier that does not require domain-specific background knowledge. We note that domain-specific ideas could further increase its performance.
Keywords :
Bayes methods; biology computing; data mining; pattern classification; biological sequences; data mining; domain-specific background knowledge; generic classifier; repeat based Naive Bayes classifier; state-of-the-art probabilistic classifiers; Bayesian methods; Data engineering; Data mining; Entropy; Feature extraction; Frequency estimation; Optimization methods; Proteins; Spatial databases; Support vector machines; Biological Sequence; Classification; Data Mining; Naive Bayes;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
Conference_Location :
Pisa
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3502-9
Type :
conf
DOI :
10.1109/ICDM.2008.66
Filename :
4781213
Link To Document :
بازگشت