DocumentCode
1004253
Title
VARUN: Discovering Extensible Motifs under Saturation Constraints
Author
Apostolico, Alberto ; Comin, Matteo ; Parida, Laxmi
Author_Institution
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
Volume
7
Issue
4
fYear
2010
Firstpage
752
Lastpage
726
Abstract
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or “don´t cares” escalate exponentially with their number, and this gets only worse if a don´t care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun1 is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.
Keywords
biology computing; molecular biophysics; proteins; VARUN; Varun1; biosequences; extensible motifs; protein sequence families; saturation constraints; software programs; syntactic specification; Bioinformatics; Biological information theory; Biological system modeling; Data mining; Error correction codes; Frequency; Genomics; Protein sequence; Testing; Computational genomics; computational genomics; data mining; motif; pattern discovery; protein family.; protein sequence; Amino Acid Motifs; Data Mining; Proteins; Sequence Analysis, Protein; Software;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2008.123
Filename
4685892
Link To Document