• DocumentCode
    2379404
  • Title

    In search of true reads: A classification approach to next generation sequencing data selection

  • Author

    Wijaya, Edward ; Pessiot, Jean-François ; Frith, Martin C. ; Fujibuchi, Wataru ; Asai, Kiyoshi ; Horton, Paul

  • fYear
    2010
  • fDate
    18-18 Dec. 2010
  • Firstpage
    561
  • Lastpage
    566
  • Abstract
    Next generation sequencing (NGS) technology has increasingly become the backbone of transcriptomics analysis, but sequencer error causes biases in the read counts. In this paper we establish a framework for predicting true sequences from NGS data. We formulate this task as a classification problem. We define several features, such as log likelihood ratio of estimated true counts, error probability and observed count of the reads. Using a Support Vector Machine (SVM) classifier, we show that on simulated reads these features can achieve 96.35% classification accuracy in discriminating true sequences. Using this framework we provide a way for users to select sequences with a desired precision and recall for their analysis. The feature generation software and the simulated data set can be obtained from (http://seq.cbrc.jp/NGSFeatGen).
  • Keywords
    DNA; bioinformatics; data analysis; error analysis; feature extraction; maximum likelihood estimation; molecular biophysics; pattern classification; support vector machines; DNA sequencing; SVM classifier; data classification; error probability; estimated true counts; feature extraction; feature generation software; log likelihood ratio; next generation sequencing technology; read counts; sequence prediction; support vector machine; transcriptomics analysis; Illumina; Solexa; classification; expectation maximization; next generation sequencing; transcriptomics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on
  • Conference_Location
    Hong, Kong
  • Print_ISBN
    978-1-4244-8303-7
  • Electronic_ISBN
    978-1-4244-8304-4
  • Type

    conf

  • DOI
    10.1109/BIBMW.2010.5703862
  • Filename
    5703862