DocumentCode :
659493
Title :
CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework
Author :
Chien-Chih Chen ; Yu-Jung Chang ; Wei-Chun Chung ; Der-Tsai Lee ; Jan-Ming Ho
Author_Institution :
Inst. of Inf. Sci. Res. Center for Inf. Technol. Innovation, Acad. Sinica Taipei, Taipei, Taiwan
fYear :
2013
fDate :
6-9 Oct. 2013
Firstpage :
717
Lastpage :
722
Abstract :
Next-generation sequencing (NGS) technologies produce huge amounts of data. These sequencing data unavoidably are accompanied by the occurrence of sequencing errors which constitutes one of the major problems of further analyses. Error correction is indeed one of the critical steps to the success of NGS applications such as de novo genome assembly and DNA resequencing as illustrated in literature. However, it requires computing time and memory space heavily. To design an algorithm to improve data quality by efficiently utilizing on-demand computing resources in the cloud is a challenge for biologists and computer scientists. In this study, we present an error-correction algorithm, called the CloudRS algorithm, for correcting errors in NGS data. The CloudRS algorithm aims at emulating the notion of error correction algorithm of ALLPATHS-LG on the Hadoop/ MapReduce framework. It is conservative in correcting sequencing errors to avoid introducing false decisions, e.g., when dealing with reads from repetitive regions. We also illustrate several probabilistic measures we introduce into CloudRS to make the algorithm more efficient without sacrificing its effectiveness. Running time of using up to 80 instances each with 8 computing units shows satisfactory speedup. Experiments of comparing with other error correction programs show that CloudRS algorithm performs lower false positive rate for most evaluation benchmarks and higher sensitivity on genome S. cerevisiae. We demonstrate that CloudRS algorithm provides significant improvements in the quality of the resulting contigs on benchmarks of NGS de novo assembly.
Keywords :
biology computing; cloud computing; error correction; ALLPATHS-LG; CloudRS; Hadoop/MapReduce framework; NGS technologies; biologists; computer scientists; data quality; error correction algorithm; high-throughput sequencing data; next-generation sequencing; on-demand computing resources; scalable framework; Algorithm design and analysis; Assembly; Benchmark testing; Bioinformatics; Error correction; Genomics; Sequential analysis; error correction; genome assembly; mapreduce; next-generation sequencing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
Type :
conf
DOI :
10.1109/BigData.2013.6691642
Filename :
6691642
Link To Document :
بازگشت