DocumentCode
2379404
Title
In search of true reads: A classification approach to next generation sequencing data selection
Author
Wijaya, Edward ; Pessiot, Jean-François ; Frith, Martin C. ; Fujibuchi, Wataru ; Asai, Kiyoshi ; Horton, Paul
fYear
2010
fDate
18-18 Dec. 2010
Firstpage
561
Lastpage
566
Abstract
Next generation sequencing (NGS) technology has increasingly become the backbone of transcriptomics analysis, but sequencer error causes biases in the read counts. In this paper we establish a framework for predicting true sequences from NGS data. We formulate this task as a classification problem. We define several features, such as log likelihood ratio of estimated true counts, error probability and observed count of the reads. Using a Support Vector Machine (SVM) classifier, we show that on simulated reads these features can achieve 96.35% classification accuracy in discriminating true sequences. Using this framework we provide a way for users to select sequences with a desired precision and recall for their analysis. The feature generation software and the simulated data set can be obtained from (http://seq.cbrc.jp/NGSFeatGen).
Keywords
DNA; bioinformatics; data analysis; error analysis; feature extraction; maximum likelihood estimation; molecular biophysics; pattern classification; support vector machines; DNA sequencing; SVM classifier; data classification; error probability; estimated true counts; feature extraction; feature generation software; log likelihood ratio; next generation sequencing technology; read counts; sequence prediction; support vector machine; transcriptomics analysis; Illumina; Solexa; classification; expectation maximization; next generation sequencing; transcriptomics;
fLanguage
English
Publisher
ieee
Conference_Titel
Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on
Conference_Location
Hong, Kong
Print_ISBN
978-1-4244-8303-7
Electronic_ISBN
978-1-4244-8304-4
Type
conf
DOI
10.1109/BIBMW.2010.5703862
Filename
5703862
Link To Document