DocumentCode
951956
Title
2SNP: Scalable Phasing Method for Trios and Unrelated Individuals
Author
Brinza, Dumitru ; Zelikovsky, Alexander
Author_Institution
Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA
Volume
5
Issue
2
fYear
2008
Firstpage
313
Lastpage
318
Abstract
Emerging microarray technologies allow affordable typing of very long genome sequences. A key challenge in analyzing such a huge amount of data is scalable and accurate computational inferring of haplotypes (that is, splitting of each genotype into a pair of corresponding haplotypes). In this paper, we first phase genotypes consisting only of two SNPs using genotypes frequencies adjusted to the random mating model and then extend the phasing of two-SNP genotypes to the phasing of complete genotypes using maximum spanning trees. The runtime of the proposed 2SNP algorithm is O(nm(n + logm)), where n and m are the numbers of genotypes and SNPs, respectively, and it can handle genotypes spanning the entire chromosomes in a matter of hours. On data sets across 23 chromosomal regions from HapMap [11], 2SNP is several orders of magnitude faster than GERBIL and PHASE when matching them in quality measured by the number of correctly phased genotypes, single-site, and switching errors. For example, the 2SNP software phases the entire chromosome (l05 SNPs from HapMap) for 30 individuals in 2 hours with an average switching error of 7.7 percent. We have also enhanced the 2SNP algorithm to phase family trio data and compared it with four other well-known phasing methods on simulated data from [15]. 2SNP is much faster than all of them while losing in quality only to PHASE. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP.
Keywords
DNA; biochemistry; biology computing; cellular biophysics; genetics; molecular biophysics; trees (mathematics); 2SNP algorithm; 2SNP software phases; GERBIL; HapMap; PHASE; chromosomal regions; computational haplotypes inferring; correctly phased genotypes; genome sequences; genotype splitting; genotypes frequencies; maximum spanning trees; microarray technologies; phase family trios data; random mating model; scalable phasing method; single nucleotide polymorphism; switching errors; unrelated individuals; SNP; algorithm; genotype; haplotype; phasing; Algorithms; Computational Biology; Databases, Nucleic Acid; Female; Genotype; Haplotypes; Humans; Male; Models, Genetic; Oligonucleotide Array Sequence Analysis; Polymorphism, Single Nucleotide; Software;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2007.1068
Filename
4359863
Link To Document