DocumentCode :
1514465
Title :
Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences
Author :
Wong, Andrew K.C. ; Zhuang, Dennis ; Li, Gary C L ; Lee, En-Shiun Annie
Author_Institution :
Dept. of Syst. Design Eng., Univ. of Waterloo, Waterloo, ON, Canada
Volume :
24
Issue :
8
fYear :
2012
Firstpage :
1408
Lastpage :
1421
Abstract :
Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.
Keywords :
data mining; sequences; delta closed pattern discovery; delta tolerance closed itemset; genomics; noninduced patterns; output quality; proteomics; redundant patterns; sequence data; statistically induced patterns; suffix tree; Algorithm design and analysis; Data mining; Frequency measurement; Genomics; Hidden Markov models; Itemsets; Markov processes; Sequence pattern discovery; delta closed patterns; statistically induced patterns; suffix tree.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2011.100
Filename :
5765954
Link To Document :
بازگشت