Groupwise analytics via adaptive MapReduce

Author

Liping Peng ; Kai Zeng ; Balmin, Andrey ; Ercegovac, Vuk ; Haas, Peter J. ; Sismanis, Yannis

Author_Institution

Sch. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA

fYear

2015

fDate

13-17 April 2015

Firstpage

1059

Lastpage

1070

Abstract

Shared-nothing systems such as Hadoop vastly simplify parallel programming when processing disk-resident data whose size exceeds aggregate cluster memory. Such systems incur a significant performance penalty, however, on the important class of “groupwise set-valued analytics” (GSVA) queries in which the data is dynamically partitioned into groups and then a set-valued synopsis is computed for some or all of the groups. Key examples of synopses include top-k sets, bottom-k sets, and uniform random samples. Applications of GSVA queries include micro-marketing, root-cause analysis for problem diagnosis, and fraud detection. A naive approach to executing GSVA queries first reshuffles all of the data so that all records in a group are at the same node and then computes the synopsis for the group. This approach can be extremely inefficient when, as is typical, only a very small fraction of the records in each group actually contribute to the final groupwise synopsis, so that most of the shuffling effort is wasted. We show how to significantly speed up GSVA queries by slightly modifying the shared-nothing environment to allow tasks to occasionally access a small, common data structure; we focus on the Hadoop setting and use the “Adaptive MapReduce” infrastructure of Vernica et al. to implement the data structure. Our approach retains most of the advantages of a system such as Hadoop while significantly improving GSVA query performance, and also allows for incremental updating of query results. Experiments show speedups of up to 5x. Importantly, our new technique can potentially be applied to other shared-nothing systems with disk-resident data.

Keywords

data handling; data structures; parallel programming; query processing; set theory; storage management; GSVA queries; Hadoop; adaptive MapReduce; aggregate cluster memory; bottom-k sets; disk-resident data; disk-resident data processing; fraud detection; groupwise set-valued analytics; micromarketing; parallel programming; problem diagnosis; root-cause analysis; set-valued synopsis; shared-nothing systems; top-k sets; uniform random samples; Aggregates; Approximation methods; Data structures; Electronic mail; Generators; Optimization; Standards;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering (ICDE), 2015 IEEE 31st International Conference on

Conference_Location

Seoul

Type

conf

DOI

10.1109/ICDE.2015.7113356

Filename

7113356