DocumentCode :
951956
Title :
2SNP: Scalable Phasing Method for Trios and Unrelated Individuals
Author :
Brinza, Dumitru ; Zelikovsky, Alexander
Author_Institution :
Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA
Volume :
5
Issue :
2
fYear :
2008
Firstpage :
313
Lastpage :
318
Abstract :
Emerging microarray technologies allow affordable typing of very long genome sequences. A key challenge in analyzing such a huge amount of data is scalable and accurate computational inferring of haplotypes (that is, splitting of each genotype into a pair of corresponding haplotypes). In this paper, we first phase genotypes consisting only of two SNPs using genotypes frequencies adjusted to the random mating model and then extend the phasing of two-SNP genotypes to the phasing of complete genotypes using maximum spanning trees. The runtime of the proposed 2SNP algorithm is O(nm(n + logm)), where n and m are the numbers of genotypes and SNPs, respectively, and it can handle genotypes spanning the entire chromosomes in a matter of hours. On data sets across 23 chromosomal regions from HapMap [11], 2SNP is several orders of magnitude faster than GERBIL and PHASE when matching them in quality measured by the number of correctly phased genotypes, single-site, and switching errors. For example, the 2SNP software phases the entire chromosome (l05 SNPs from HapMap) for 30 individuals in 2 hours with an average switching error of 7.7 percent. We have also enhanced the 2SNP algorithm to phase family trio data and compared it with four other well-known phasing methods on simulated data from [15]. 2SNP is much faster than all of them while losing in quality only to PHASE. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP.
Keywords :
DNA; biochemistry; biology computing; cellular biophysics; genetics; molecular biophysics; trees (mathematics); 2SNP algorithm; 2SNP software phases; GERBIL; HapMap; PHASE; chromosomal regions; computational haplotypes inferring; correctly phased genotypes; genome sequences; genotype splitting; genotypes frequencies; maximum spanning trees; microarray technologies; phase family trios data; random mating model; scalable phasing method; single nucleotide polymorphism; switching errors; unrelated individuals; SNP; algorithm; genotype; haplotype; phasing; Algorithms; Computational Biology; Databases, Nucleic Acid; Female; Genotype; Haplotypes; Humans; Male; Models, Genetic; Oligonucleotide Array Sequence Analysis; Polymorphism, Single Nucleotide; Software;
fLanguage :
English
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
1545-5963
Type :
jour
DOI :
10.1109/TCBB.2007.1068
Filename :
4359863
Link To Document :
بازگشت