• DocumentCode
    2709179
  • Title

    Space Efficient String Mining under Frequency Constraints

  • Author

    Fischer, Johannes ; Makinen, Veli ; Valimaki, Niko

  • Author_Institution
    Center for Bioinf. (ZBIT), Univ. Tubingen, Tubingen, Germany
  • fYear
    2008
  • fDate
    15-19 Dec. 2008
  • Firstpage
    193
  • Lastpage
    202
  • Abstract
    Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 - e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as item-sets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Sigma| Lt n (in particular for constant |Sigma|), as the databases themselves occupy only n log |Sigma| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Sigma| + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.
  • Keywords
    data mining; string matching; biologically relevant datasets; d strings; frequency constraints; mining discriminative patterns; real-life applications; space efficient string mining; Clustering algorithms; Cost function; Data analysis; Data mining; Frequency; Lagrangian functions; Linear discriminant analysis; Support vector machine classification; Support vector machines; Unsupervised learning; |constraint based string mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
  • Conference_Location
    Pisa
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3502-9
  • Type

    conf

  • DOI
    10.1109/ICDM.2008.32
  • Filename
    4781114