DocumentCode
22856
Title
Evolved Features for DNA Sequence Classification and Their Fitness Landscapes
Author
Ashlock, Wendy ; Datta, Soupayan
Author_Institution
Department ofComputer Science and Engineering, York University, Toronto, Canada
Volume
17
Issue
2
fYear
2013
fDate
Apr-13
Firstpage
185
Lastpage
197
Abstract
A key problem in genomics is the classification and annotation of sequences in a genome. A major challenge is identifying good sequence features. Evolutionary algorithms have the potential to search a large space of features and automatically generate useful ones. This paper proposes a two-stage method that generates features using multiple replicates of a genetic algorithm operating on an augmented finite state machine, called a side effect machine (SEM), and then selects a small diverse feature set using several methods, including a novel method called dissimilarity clustering. We apply our method to three problems related to transposable elements and compare the results to those using
-mer features. We are able to produce a small set of interesting and comprehensible features that create random forest classifiers more accurate and less prone to overfitting than those created using
-mer features. We analyze the SEM fitness landscapes and discuss the use of different fitness functions.
Keywords
Bioinformatics; DNA; Genetic algorithms; Genomics; Microwave integrated circuits; Training; Automatic feature generation; DNA sequence classification; clustering; fitness landscape; side effect machines (SEMs);
fLanguage
English
Journal_Title
Evolutionary Computation, IEEE Transactions on
Publisher
ieee
ISSN
1089-778X
Type
jour
DOI
10.1109/TEVC.2012.2207120
Filename
6232454
Link To Document