DocumentCode :
3334563
Title :
Mining Frequent Patterns with Wildcards from Biological Sequences
Author :
He, Yu ; Wu, Xindong ; Zhu, Xingquan ; Arslan, Abdullah N.
Author_Institution :
Univ. of Vermont, Burlington
fYear :
2007
fDate :
13-15 Aug. 2007
Firstpage :
329
Lastpage :
334
Abstract :
Frequent pattern mining from sequences is a crucial step for many domain experts, such as molecular biologists, to discover rules or patterns hidden in their data. In order to find specific patterns, many existing tools require users to specify gap constraints beforehand. In reality, it is often nontrivial to let a user provide such gap constraints. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Consequently it is desirable to develop an algorithm to automatically and efficiently find patterns without user-specified gap constraints. In this paper, a frequent pattern mining problem without user-specified gap constraints is presented and studied. Given a sequence and a support threshold value, all subsequences whose support is not less than the given threshold value will be discovered. These frequent subsequences then form patterns later on. Two heuristic methods (one-way vs two-way scan) are proposed to mine frequent subsequences and estimate the maximum support for both artificial and real world data. Given a specific pattern, the simulated results demonstrate that the one-way scan heuristic performs better in the sense of estimating the maximum support with more than ninety percent accuracy.
Keywords :
DNA; biology computing; data mining; pattern recognition; DNA biological sequences; discover rules; frequent pattern mining; frequent subsequence mining; gap constraints; molecular biology; pattern finding; wildcard; Amino acids; Bioinformatics; Biological information theory; Biology; Computer science; DNA; Data engineering; Genomics; Helium; Protein sequence;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
Conference_Location :
Las Vegas, IL
Print_ISBN :
1-4244-1500-4
Electronic_ISBN :
1-4244-1500-4
Type :
conf
DOI :
10.1109/IRI.2007.4296642
Filename :
4296642
Link To Document :
بازگشت