Title :
Superiority of Spaced Seeds for Homology Search
Author_Institution :
Nat. Univ. of Singapore, Singapore
Abstract :
In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of nonoverlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that, when the length of a nonuniformly spaced seed is bounded above by an exponential function of the seed weight, the seed strictly outperforms the traditional consecutive seed of the same weight in both 1) the average number of nonoverlapping hits and 2) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.
Keywords :
biology computing; Bernoulli sequence model; bioinformatics; homology search; long optimal seed; seed weight; spaced seeds; Bioinformatics; Costs; DNA; Databases; Filtration; Pattern matching; Probability; Protein sequence; Statistics; Homology search; pattern matching; renewal theory; run statistics; sequence alignment; spaced seeds; Algorithms; Pattern Recognition, Automated; Sequence Alignment; Sequence Analysis; Sequence Homology;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/tcbb.2007.1013