In search of perfect reads

Author

Pal, Soumitra ; Aluru, Srinivas

Author_Institution

Dept. of Comput. Sci. & Eng., Indian Inst. of Technol. Bombay, Mumbai, India

fYear

2014

fDate

2-4 June 2014

Firstpage

1

Lastpage

2

Abstract

Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down the error rates, for example within 1% for Illumina HiSeq reads. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can have significant impact on run-time complexity of applications. In this paper, we present a simple and fast k-spectrum analysis based method to identify error-free reads. Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the coverage by about 7% on an average, coverage pattern across genome remains similar. The filtration process can be customized at several levels of stringency depending upon the downstream application need.

Keywords

error analysis; filtration; genomics; error-free read identification; fast k-spectrum analysis based method; filtration process; genomics; next generation short-read sequencing technologies; Accuracy; Bioinformatics; Error correction; Genomics; Next generation networking; Prediction algorithms; Sequential analysis; Next generation sequencing; error correction;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Advances in Bio and Medical Sciences (ICCABS), 2014 IEEE 4th International Conference on

Conference_Location

Miami, FL

Print_ISBN

978-1-4799-5786-6

Type

conf

DOI

10.1109/ICCABS.2014.6863919

Filename

6863919