DocumentCode :
1537876
Title :
The shortest common superstring problem: average case analysis for both exact and approximate matching
Author :
Yang, En-Hui ; Zhang, Zhen
Author_Institution :
Dept. of Electr. & Comput. Eng., Waterloo Univ., Ont., Canada
Volume :
45
Issue :
6
fYear :
1999
fDate :
9/1/1999 12:00:00 AM
Firstpage :
1867
Lastpage :
1886
Abstract :
The shortest common superstring problem and its extension to approximate matching are considered in the probability model where each string in a given set has the same length and letters of strings are drawn independently from a finite set. In the exact matching case, several algorithms proposed in the literature are shown to be asymptotically optimal in the sense that the ratio of the savings resulting from the superstring constructed by each of these algorithms, that is the difference between the total length of the strings in the given set and the length of the superstring, to the optimal savings from the shortest superstring approaches in probability to 1 as the number of strings in the given set increases. In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed; it is demonstrated that the optimal savings in this case is given approximately by nlogn/Il(Q,Q,2D), where n is the number of strings in the given set, Q is the probability distribution governing the selection of letters of strings, Il(Q,Q,2D) is the lower mutual information between Q and Q with respect to 2D, and D⩾0 is the distortion allowed in approximate matching. In addition, an approximation algorithm is proposed and proved asymptotically optimal
Keywords :
data compression; information theory; optimisation; probability; set theory; superstrings; DNA sequencing; approximate matching; approximation algorithm; asymptotically optimal algorithms; average case analysis; data compression; distortion; exact matching; finite set; letters; lower mutual information; optimal savings; probability; probability distribution; probability model; shortest common approximate matching superstring; shortest common superstring problem; string length; superstring length; Approximation algorithms; Assembly; Computer aided software engineering; DNA; Data compression; Databases; Greedy algorithms; Information analysis; Mutual information; Probability distribution;
fLanguage :
English
Journal_Title :
Information Theory, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9448
Type :
jour
DOI :
10.1109/18.782108
Filename :
782108
Link To Document :
بازگشت