Title :
Correcting Base-Assignment Errors in Repeat Regions of Shotgun Assembly
Author :
Zhi, Degui ; Keich, Uri ; Pevzner, Pavel ; Heber, Steffen ; Tang, Haixu
Author_Institution :
Bioinf. Program, California Univ., San Diego, La Jolla, CA
Abstract :
Accurate base-assignment in repeat regions of a whole genome shotgun assembly is an unsolved problem. Since reads in repeat regions cannot be easily attributed to a unique location in the genome, current assemblers may place these reads arbitrarily. As a result, the base-assignment error rate in repeats is likely to be much higher than that in the rest of the genome. We developed an iterative algorithm, EULER-AIR, that is able to correct base-assignment errors in finished genome sequences in public databases. The Wolbachia genome is among the best finished genomes. Using this genome project as an example, we demonstrated that EULER-AIR can 1) discover and correct base-assignment errors, 2) provide accurate read assignments, 3) utilize finishing reads for accurate base-assignment, and 4) provide guidance for designing finishing experiments. In the genome of Wolbachia, EULER-AIR found 16 positions with ambiguous base-assignment and two positions with erroneous bases. Besides Wolbachia, many other genome sequencing projects have significantly fewer finishing reads and, hence, are likely to contain more base-assignment errors in repeats. We demonstrate that EULER-AIR is a software tool that can be used to find and correct base-assignment errors in a genome assembly project
Keywords :
biology computing; error correction; genetics; iterative methods; molecular biophysics; Wolbachia genome; accurate read assignments; base-assignment error correction; finished genome sequences; genome sequencing; iterative algorithm EULER-AIR; public databases; repeat regions; whole genome shotgun assembly; Assembly; Bioinformatics; DNA; Databases; Error analysis; Error correction; Finishing; Genomics; Sequences; Software tools; Fragment assembly; expectation maximization.; finishing; Algorithms; Campylobacter jejuni; Cluster Analysis; Computational Biology; Genome, Bacterial; Lactococcus lactis; Models, Statistical; Repetitive Sequences, Nucleic Acid; Sequence Alignment; Sequence Analysis, DNA; Software; Staphylococcus epidermidis; Wolbachia;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/TCBB.2007.1005