Title :
A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling
Author :
Dixin Tang;Taoying Liu;Rubao Lee;Hong Liu;Wei Li
Abstract :
Current major big data analytical stacks often consist of a general purpose, multi-staged computation framework (e.g. Hadoop) and an SQL query system (e.g. Hive) on its top. A key factor of query performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). In current data shuffling, various useful information about the shuffled data and the query on the data is simply wasted. In this paper, we make a strong case of cross-layer optimizations for Hive/Hadoop stack: we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which carefully leverages the rich information in data and queries to optimize the overall query processing. Our experimental results with industry-standard TPC-H benchmark show that, by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x.
Keywords :
"Merging","Sorting","Optimization","Data mining","Yttrium","Big data","Compression algorithms"
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
DOI :
10.1109/CLUSTER.2015.19