DocumentCode
2709179
Title
Space Efficient String Mining under Frequency Constraints
Author
Fischer, Johannes ; Makinen, Veli ; Valimaki, Niko
Author_Institution
Center for Bioinf. (ZBIT), Univ. Tubingen, Tubingen, Germany
fYear
2008
fDate
15-19 Dec. 2008
Firstpage
193
Lastpage
202
Abstract
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 - e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as item-sets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Sigma| Lt n (in particular for constant |Sigma|), as the databases themselves occupy only n log |Sigma| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Sigma| + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.
Keywords
data mining; string matching; biologically relevant datasets; d strings; frequency constraints; mining discriminative patterns; real-life applications; space efficient string mining; Clustering algorithms; Cost function; Data analysis; Data mining; Frequency; Lagrangian functions; Linear discriminant analysis; Support vector machine classification; Support vector machines; Unsupervised learning; |constraint based string mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
Conference_Location
Pisa
ISSN
1550-4786
Print_ISBN
978-0-7695-3502-9
Type
conf
DOI
10.1109/ICDM.2008.32
Filename
4781114
Link To Document