• DocumentCode
    22856
  • Title

    Evolved Features for DNA Sequence Classification and Their Fitness Landscapes

  • Author

    Ashlock, Wendy ; Datta, Soupayan

  • Author_Institution
    Department ofComputer Science and Engineering, York University, Toronto, Canada
  • Volume
    17
  • Issue
    2
  • fYear
    2013
  • fDate
    Apr-13
  • Firstpage
    185
  • Lastpage
    197
  • Abstract
    A key problem in genomics is the classification and annotation of sequences in a genome. A major challenge is identifying good sequence features. Evolutionary algorithms have the potential to search a large space of features and automatically generate useful ones. This paper proposes a two-stage method that generates features using multiple replicates of a genetic algorithm operating on an augmented finite state machine, called a side effect machine (SEM), and then selects a small diverse feature set using several methods, including a novel method called dissimilarity clustering. We apply our method to three problems related to transposable elements and compare the results to those using k -mer features. We are able to produce a small set of interesting and comprehensible features that create random forest classifiers more accurate and less prone to overfitting than those created using k -mer features. We analyze the SEM fitness landscapes and discuss the use of different fitness functions.
  • Keywords
    Bioinformatics; DNA; Genetic algorithms; Genomics; Microwave integrated circuits; Training; Automatic feature generation; DNA sequence classification; clustering; fitness landscape; side effect machines (SEMs);
  • fLanguage
    English
  • Journal_Title
    Evolutionary Computation, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1089-778X
  • Type

    jour

  • DOI
    10.1109/TEVC.2012.2207120
  • Filename
    6232454