• DocumentCode
    3127457
  • Title

    Frequent Pairs in Data Streams: Exploiting Parallelism and Skew

  • Author

    Campagna, Andrea ; Kutzkov, Konstantin ; Pagh, Rasmus

  • Author_Institution
    IT Univ. of Copenhagen, Copenhagen, Denmark
  • fYear
    2011
  • fDate
    11-11 Dec. 2011
  • Firstpage
    145
  • Lastpage
    150
  • Abstract
    We introduce the Pair Streaming Engine (PairSE) that detects frequent pairs in a data stream of transactions. Our algorithm finds the most frequent pairs with high probability, and gives tight bounds on their frequency. It is particularly space efficient for skewed distribution of pair supports, confirmed for several real-world datasets. Additionally, the algorithm parallelizes easily, which opens up for real-time processing of large transactions. Unlike previous algorithms we make no assumptions on the order of arrival of transactions and pairs. Our algorithm builds upon approaches for frequent items mining in data streams. We show how to efficiently scale these approaches to handle large transactions. We report experimental results showcasing precision and recall of our method. In particular, we find that often our method achieves excellent precision, returning identical upper and lower bounds on the supports of the most frequent pairs.
  • Keywords
    data handling; probability; PairSE; data stream mining; pair streaming engine; probability analysis; real-time processing; real-world datasets; skewed distribution; Accidents; Accuracy; Data mining; Data structures; Frequency estimation; Itemsets; Radiation detectors; algorithm; association rule; data stream; parallel; sharednothing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    978-1-4673-0005-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2011.87
  • Filename
    6137373