Title :
Abstraction Augmented Markov Models
Author :
Caragea, Cornelia ; Silvescu, Adrian ; Caragea, Doina ; Honavar, Vasant
Author_Institution :
Comput. Sci., Iowa State Univ., Ames, IA, USA
Abstract :
High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of over fitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of kth order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein sub cellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.
Keywords :
Markov processes; bioinformatics; cellular biophysics; feature extraction; pattern classification; prediction theory; proteins; MM parameter; abstraction augmented Markov model; abstraction hierarchy; decomposed context tree weighting; numeric parameter; predictive model; probabilistic suffix tree; protein subcellular localization prediction task; sequence classification; Markov models; abstraction; sequence classification;
Conference_Titel :
Data Mining (ICDM), 2010 IEEE 10th International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-9131-5
Electronic_ISBN :
1550-4786
DOI :
10.1109/ICDM.2010.158