DocumentCode :
2737411
Title :
Poster: Haplotype discovery from high-throughput sequencing data
Author :
Mangul, Serghei ; Zelikovsky, Alex
Author_Institution :
Comput. Sci. Dept., Georgia State Univ., Atlanta, GA, USA
fYear :
2011
fDate :
3-5 Feb. 2011
Firstpage :
255
Lastpage :
255
Abstract :
The problem of haplotype discovery from from high throughput sequencing data arises in several settings when sequencing machines are used for recovering in a given sample of all transcripts or bacteria (transcriptome and bacteriome reconstruction) as well as assembly of viral quasispecies and their frequencies. The standard model includes an instance of sequencing machine run consisting of a set of reads with observed frequencies and a (prior) panel consisting of (1) a set of candidate strings that are believed to emit the observed reads and (2) a weighted match between reads and strings, where weight is calculated based on the mapping of the reads to the strings. The spectrum reconstruction problem asks for the most likely string frequencies explaining emission of observed reads assuming that reads are emitted uniformly at random by strings according to the panel. The possible gaps in the standard model include (a) erroneous reads (caused by genotyping errors), (b) an incorrect list of candidate strings (absence of candidates caused by gaps in current databases and presence of chimeric candidates), (c) an inaccurate read to-string match and, finally, (d) a non-uniform emitting of reads by strings. Since the genotyping quality is improving we focus on the incompleteness of the panel, i.e., the existing list of candidate strings. The method presented here, referred to as HAPDIS, assumes that reads are emitted uniformly but allows for unknown strings. The model quality is measured by the deviation between expected and observed read frequencies. Based on the model quality it is possible to decide if the panel is likely to be incomplete. Subsequently, the total frequency of strings missing from the panel is estimated by adding a virtual string whose weighted matches with reads are determined by deficit or excess of observed over expected read frequencies. Based on weighted matches between the virtual string and reads we show that it is possible to extract the spectrum - f reads (reads and their frequencies) that are likely to be emitted by the missing strings.
Keywords :
biology computing; cellular biophysics; genetics; microorganisms; molecular biophysics; molecular configurations; HAPDIS; bacteria; bacteriome reconstruction; candidate strings; erroneous reads; genotyping errors; haplotype discovery; high-throughput sequencing data; nonuniform read emitting; sequencing machine; spectrum reconstruction; transcriptome; viral frequencies; viral quasispecies; Assembly; Bioinformatics; Data models; Databases; Frequency estimation; Splicing; 454 pyrosequencing; RNA-Seq; alternative splicing; expectation maximization; maximum likelihood;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on
Conference_Location :
Orlando, FL
Print_ISBN :
978-1-61284-851-8
Type :
conf
DOI :
10.1109/ICCABS.2011.5729908
Filename :
5729908
Link To Document :
بازگشت