DocumentCode
3172008
Title
Towards Machine Learning-Based Auto-tuning of MapReduce
Author
Yigitbasi, Nezih ; Willke, Theodore L. ; Guangdeng Liao ; Epema, Dick
Author_Institution
Intel Labs., Hillsboro, OR, USA
fYear
2013
fDate
14-16 Aug. 2013
Firstpage
11
Lastpage
20
Abstract
MapReduce, which is the de facto programming model for large-scale distributed data processing, and its most popular implementation Hadoop have enjoyed widespread adoption in industry during the past few years. Unfortunately, from a performance point of view getting the most out of Hadoop is still a big challenge due to the large number of configuration parameters. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. Even worse, the parameters have to be re-tuned for different MapReduce applications and clusters. To make the parameter tuning process more effective, in this paper we explore machine learning-based performance models that we use to auto-tune the configuration parameters. To this end, we first evaluate several machine learning models with diverse MapReduce applications and cluster configurations, and we show that support vector regression model (SVR) has good accuracy and is also computationally efficient. We further assess our auto-tuning approach, which uses the SVR performance model, against the Starfish auto tuner, which uses a cost-based performance model. Our findings reveal that our auto-tuning approach can provide comparable or in some cases better performance improvements than Starfish with a smaller number of parameters. Finally, we propose and discuss a complete and practical end-to-end auto-tuning flow that combines our machine learning-based performance models with smart search algorithms for the effective training of the models and the effective exploration of the parameter space.
Keywords
distributed programming; learning (artificial intelligence); public domain software; regression analysis; search problems; support vector machines; Hadoop; MapReduce; SVR performance model; cluster configurations; configuration parameters; cost-based performance model; de facto programming model; end-to-end autotuning flow; large-scale distributed data processing; machine learning-based autotuning approach; machine learning-based performance models; parameter tuning process; smart search algorithms; starfish autotuner; support vector regression model; Accuracy; Benchmark testing; Computational modeling; Data models; Training; Training data; Tuning; big data; distributed systems; hadoop; performance modeling;
fLanguage
English
Publisher
ieee
Conference_Titel
Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2013 IEEE 21st International Symposium on
Conference_Location
San Francisco, CA
ISSN
1526-7539
Type
conf
DOI
10.1109/MASCOTS.2013.9
Filename
6730744
Link To Document