Title :
Efficient algorithms for error correction and compression of NGS data
Author :
Saha, Simanto ; Rajasekaran, Sanguthevar
Author_Institution :
Dept. of CSE, Univ. of Connecticut, Storrs, CT, USA
Abstract :
Summary form only given. In this talk we present our algorithms for two important problems in processing next generation sequencing (NGS) data. We live in an era of data explosion. As an example, NCBI houses petabytes of genomic data and biologists around the world are generating 15 petabases of sequence per year. The size of metagenomic data from multiple samples could be petabytes. Biologists want to store these datasets for several reasons. Standard compression algorithms fail to do a good job on these datasets. Several approaches for compressing genomic data have been proposed in the literature. These approaches differ based on the particular type of data being compressed. Some example types are genomic data (with and without a reference), reads data (FASTA files), and FASTQ files. We have come up with algorithms for genomic data (with a reference) and FASTA file (without a reference) that perform better than some of best known algorithms for these two versions. In NGS technology, the chances of low read coverage in some regions of the sequences are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will greatly reduce if the reads are first corrected. We have developed a novel error correcting algorithm called EC and compared it with three other well-known algorithms using both real and simulated reads. We have done extensive and rigorous experiments that reveal that EC is indeed an effective and efficient error correction tool.
Keywords :
DNA; biology computing; computational complexity; data compression; error analysis; genetics; genomics; DNA molecule; EC; FASTA files; FASTQ files; NCBI houses; NGS data compression; NGS technology; biological applications; biologists; computational complexity; consensus region; data explosion; data quality; erroneous base calling; error correcting algorithm; error correction tool; error rate; genomic data compression; metagenomic data size; next generation sequencing data; overlapping segments; petabytes; sequence assemblers; standard compression algorithms; Algorithm design and analysis; Bioinformatics; DNA; Educational institutions; Error correction; Genomics;
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2014 IEEE 4th International Conference on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4799-5786-6
DOI :
10.1109/ICCABS.2014.6863941