• DocumentCode
    2720605
  • Title

    Towards a parameter-free and parallel itemset mining algorithm in linearithmic time

  • Author

    Buehrer, Gregory ; de Oliveira, Roberto L. ; Fuhry, David ; Parthasarathy, Srinivasan

  • Author_Institution
    Microsoft, Redmond, WA, USA
  • fYear
    2015
  • fDate
    13-17 April 2015
  • Firstpage
    1071
  • Lastpage
    1082
  • Abstract
    Extracting interesting patterns from large data stores efficiently is a challenging problem in many domains. In the data mining literature, pattern frequency has often been touted as a proxy for interestingness and has been leveraged as a pruning criteria to realize scalable solutions. However, while there exist many frequent pattern algorithms in the literature, all scale exponentially in the worst case, restricting their utility on very large data sets. Furthermore, as we theoretically argue in this article, the problem is very hard to approximate within a reasonable factor, with a polynomial time algorithm. As a counter point to this theoretical result, we present a practical algorithm called Localized Approximate Miner (LAM) that scales linearithmically with the input data. Instead of fully exploring the top of the search lattice to a user-defined point, as traditional mining algorithms do, we instead explore different parts of the complete lattice, efficiently. The key to this efficient exploration is the reliance on min-wise independent permutations to collect the data into highly similar subsets of a partition. It is straightforward to implement and scales to very large data sets. We illustrate its utility on a range of data sets, and demonstrate that the algorithm finds more patterns of higher utility in much less time than several state-of-the-art algorithms. Moreover, we realize a natural multi-level parallelization of LAM that further reduces runtimes by up to 193-fold when leveraging 256 CMP cores spanning 32 machines.
  • Keywords
    computational complexity; data mining; parallel processing; CMP cores; LAM; data collection; data mining; data sets; frequent pattern algorithms; large data storage; linearithmic time; localized approximate miner; min-wise independent permutation; natural multilevel parallelization; parameter-free parallel itemset mining algorithm; pattern frequency; polynomial time algorithm; search lattice; user-defined point; Integrated circuits; Itemsets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2015 IEEE 31st International Conference on
  • Conference_Location
    Seoul
  • Type

    conf

  • DOI
    10.1109/ICDE.2015.7113357
  • Filename
    7113357