مرکز منطقه ای اطلاع رساني علوم و فناوري - Poster: Haplotype discovery from high-throughput sequencing data

DocumentCode :

2737411

Title :

Poster: Haplotype discovery from high-throughput sequencing data

Author :

Mangul, Serghei ; Zelikovsky, Alex

Author_Institution :

Comput. Sci. Dept., Georgia State Univ., Atlanta, GA, USA

fYear :

2011

fDate :

3-5 Feb. 2011

Firstpage :

255

Lastpage :

255

Abstract :

The problem of haplotype discovery from from high throughput sequencing data arises in several settings when sequencing machines are used for recovering in a given sample of all transcripts or bacteria (transcriptome and bacteriome reconstruction) as well as assembly of viral quasispecies and their frequencies. The standard model includes an instance of sequencing machine run consisting of a set of reads with observed frequencies and a (prior) panel consisting of (1) a set of candidate strings that are believed to emit the observed reads and (2) a weighted match between reads and strings, where weight is calculated based on the mapping of the reads to the strings. The spectrum reconstruction problem asks for the most likely string frequencies explaining emission of observed reads assuming that reads are emitted uniformly at random by strings according to the panel. The possible gaps in the standard model include (a) erroneous reads (caused by genotyping errors), (b) an incorrect list of candidate strings (absence of candidates caused by gaps in current databases and presence of chimeric candidates), (c) an inaccurate read to-string match and, finally, (d) a non-uniform emitting of reads by strings. Since the genotyping quality is improving we focus on the incompleteness of the panel, i.e., the existing list of candidate strings. The method presented here, referred to as HAPDIS, assumes that reads are emitted uniformly but allows for unknown strings. The model quality is measured by the deviation between expected and observed read frequencies. Based on the model quality it is possible to decide if the panel is likely to be incomplete. Subsequently, the total frequency of strings missing from the panel is estimated by adding a virtual string whose weighted matches with reads are determined by deficit or excess of observed over expected read frequencies. Based on weighted matches between the virtual string and reads we show that it is possible to extract the spectrum - f reads (reads and their frequencies) that are likely to be emitted by the missing strings.

Keywords :

biology computing; cellular biophysics; genetics; microorganisms; molecular biophysics; molecular configurations; HAPDIS; bacteria; bacteriome reconstruction; candidate strings; erroneous reads; genotyping errors; haplotype discovery; high-throughput sequencing data; nonuniform read emitting; sequencing machine; spectrum reconstruction; transcriptome; viral frequencies; viral quasispecies; Assembly; Bioinformatics; Data models; Databases; Frequency estimation; Splicing; 454 pyrosequencing; RNA-Seq; alternative splicing; expectation maximization; maximum likelihood;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on

Conference_Location :

Orlando, FL

Print_ISBN :

978-1-61284-851-8

Type :

conf

DOI :

10.1109/ICCABS.2011.5729908

Filename :

5729908

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2737411