DocumentCode :
22856
Title :
Evolved Features for DNA Sequence Classification and Their Fitness Landscapes
Author :
Ashlock, Wendy ; Datta, Soupayan
Author_Institution :
Department ofComputer Science and Engineering, York University, Toronto, Canada
Volume :
17
Issue :
2
fYear :
2013
fDate :
Apr-13
Firstpage :
185
Lastpage :
197
Abstract :
A key problem in genomics is the classification and annotation of sequences in a genome. A major challenge is identifying good sequence features. Evolutionary algorithms have the potential to search a large space of features and automatically generate useful ones. This paper proposes a two-stage method that generates features using multiple replicates of a genetic algorithm operating on an augmented finite state machine, called a side effect machine (SEM), and then selects a small diverse feature set using several methods, including a novel method called dissimilarity clustering. We apply our method to three problems related to transposable elements and compare the results to those using k -mer features. We are able to produce a small set of interesting and comprehensible features that create random forest classifiers more accurate and less prone to overfitting than those created using k -mer features. We analyze the SEM fitness landscapes and discuss the use of different fitness functions.
Keywords :
Bioinformatics; DNA; Genetic algorithms; Genomics; Microwave integrated circuits; Training; Automatic feature generation; DNA sequence classification; clustering; fitness landscape; side effect machines (SEMs);
fLanguage :
English
Journal_Title :
Evolutionary Computation, IEEE Transactions on
Publisher :
ieee
ISSN :
1089-778X
Type :
jour
DOI :
10.1109/TEVC.2012.2207120
Filename :
6232454
Link To Document :
بازگشت