مرکز منطقه ای اطلاع رساني علوم و فناوري - The shortest common superstring problem: average case analysis for both exact and approximate matching

DocumentCode :

1537876

Title :

The shortest common superstring problem: average case analysis for both exact and approximate matching

Author :

Yang, En-Hui ; Zhang, Zhen

Author_Institution :

Dept. of Electr. & Comput. Eng., Waterloo Univ., Ont., Canada

Volume :

Issue :

fYear :

1999

fDate :

9/1/1999 12:00:00 AM

Firstpage :

1867

Lastpage :

1886

Abstract :

The shortest common superstring problem and its extension to approximate matching are considered in the probability model where each string in a given set has the same length and letters of strings are drawn independently from a finite set. In the exact matching case, several algorithms proposed in the literature are shown to be asymptotically optimal in the sense that the ratio of the savings resulting from the superstring constructed by each of these algorithms, that is the difference between the total length of the strings in the given set and the length of the superstring, to the optimal savings from the shortest superstring approaches in probability to 1 as the number of strings in the given set increases. In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed; it is demonstrated that the optimal savings in this case is given approximately by nlogn/I_l(Q,Q,2D), where n is the number of strings in the given set, Q is the probability distribution governing the selection of letters of strings, I_l(Q,Q,2D) is the lower mutual information between Q and Q with respect to 2D, and D⩾0 is the distortion allowed in approximate matching. In addition, an approximation algorithm is proposed and proved asymptotically optimal

Keywords :

data compression; information theory; optimisation; probability; set theory; superstrings; DNA sequencing; approximate matching; approximation algorithm; asymptotically optimal algorithms; average case analysis; data compression; distortion; exact matching; finite set; letters; lower mutual information; optimal savings; probability; probability distribution; probability model; shortest common approximate matching superstring; shortest common superstring problem; string length; superstring length; Approximation algorithms; Assembly; Computer aided software engineering; DNA; Data compression; Databases; Greedy algorithms; Information analysis; Mutual information; Probability distribution;

fLanguage :

English

Journal_Title :

Information Theory, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9448

Type :

jour

DOI :

10.1109/18.782108

Filename :

782108

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1537876