DocumentCode :
2415170
Title :
Scalable, updatable predictive models for sequence data
Author :
Koul, Neeraj ; Bui, Ngot ; Honavar, Vasant
Author_Institution :
Dept. of Comput. Sci., Iowa State Univ., Ames, IA, USA
fYear :
2010
fDate :
18-21 Dec. 2010
Firstpage :
681
Lastpage :
685
Abstract :
The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.
Keywords :
Markov processes; bioinformatics; genetics; learning (artificial intelligence); pattern classification; query processing; trees (mathematics); Markov models; data repositories; data rich domains; machine learning algorithms; massive data size; massive sequence datasets; predictive models; probabilistic suffix trees; query complexity; scalable updatable predictive models; sequence data; sequence databases; statistical queries; Complexity theory; Computational modeling; Data models; Hidden Markov models; Markov processes; Mathematical model; Predictive models; Markov Model; PSTs; sufficient statistics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4244-8306-8
Electronic_ISBN :
978-1-4244-8307-5
Type :
conf
DOI :
10.1109/BIBM.2010.5706652
Filename :
5706652
Link To Document :
بازگشت