• DocumentCode
    2720587
  • Title

    Groupwise analytics via adaptive MapReduce

  • Author

    Liping Peng ; Kai Zeng ; Balmin, Andrey ; Ercegovac, Vuk ; Haas, Peter J. ; Sismanis, Yannis

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
  • fYear
    2015
  • fDate
    13-17 April 2015
  • Firstpage
    1059
  • Lastpage
    1070
  • Abstract
    Shared-nothing systems such as Hadoop vastly simplify parallel programming when processing disk-resident data whose size exceeds aggregate cluster memory. Such systems incur a significant performance penalty, however, on the important class of “groupwise set-valued analytics” (GSVA) queries in which the data is dynamically partitioned into groups and then a set-valued synopsis is computed for some or all of the groups. Key examples of synopses include top-k sets, bottom-k sets, and uniform random samples. Applications of GSVA queries include micro-marketing, root-cause analysis for problem diagnosis, and fraud detection. A naive approach to executing GSVA queries first reshuffles all of the data so that all records in a group are at the same node and then computes the synopsis for the group. This approach can be extremely inefficient when, as is typical, only a very small fraction of the records in each group actually contribute to the final groupwise synopsis, so that most of the shuffling effort is wasted. We show how to significantly speed up GSVA queries by slightly modifying the shared-nothing environment to allow tasks to occasionally access a small, common data structure; we focus on the Hadoop setting and use the “Adaptive MapReduce” infrastructure of Vernica et al. to implement the data structure. Our approach retains most of the advantages of a system such as Hadoop while significantly improving GSVA query performance, and also allows for incremental updating of query results. Experiments show speedups of up to 5x. Importantly, our new technique can potentially be applied to other shared-nothing systems with disk-resident data.
  • Keywords
    data handling; data structures; parallel programming; query processing; set theory; storage management; GSVA queries; Hadoop; adaptive MapReduce; aggregate cluster memory; bottom-k sets; disk-resident data; disk-resident data processing; fraud detection; groupwise set-valued analytics; micromarketing; parallel programming; problem diagnosis; root-cause analysis; set-valued synopsis; shared-nothing systems; top-k sets; uniform random samples; Aggregates; Approximation methods; Data structures; Electronic mail; Generators; Optimization; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2015 IEEE 31st International Conference on
  • Conference_Location
    Seoul
  • Type

    conf

  • DOI
    10.1109/ICDE.2015.7113356
  • Filename
    7113356