DocumentCode :
3172008
Title :
Towards Machine Learning-Based Auto-tuning of MapReduce
Author :
Yigitbasi, Nezih ; Willke, Theodore L. ; Guangdeng Liao ; Epema, Dick
Author_Institution :
Intel Labs., Hillsboro, OR, USA
fYear :
2013
fDate :
14-16 Aug. 2013
Firstpage :
11
Lastpage :
20
Abstract :
MapReduce, which is the de facto programming model for large-scale distributed data processing, and its most popular implementation Hadoop have enjoyed widespread adoption in industry during the past few years. Unfortunately, from a performance point of view getting the most out of Hadoop is still a big challenge due to the large number of configuration parameters. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. Even worse, the parameters have to be re-tuned for different MapReduce applications and clusters. To make the parameter tuning process more effective, in this paper we explore machine learning-based performance models that we use to auto-tune the configuration parameters. To this end, we first evaluate several machine learning models with diverse MapReduce applications and cluster configurations, and we show that support vector regression model (SVR) has good accuracy and is also computationally efficient. We further assess our auto-tuning approach, which uses the SVR performance model, against the Starfish auto tuner, which uses a cost-based performance model. Our findings reveal that our auto-tuning approach can provide comparable or in some cases better performance improvements than Starfish with a smaller number of parameters. Finally, we propose and discuss a complete and practical end-to-end auto-tuning flow that combines our machine learning-based performance models with smart search algorithms for the effective training of the models and the effective exploration of the parameter space.
Keywords :
distributed programming; learning (artificial intelligence); public domain software; regression analysis; search problems; support vector machines; Hadoop; MapReduce; SVR performance model; cluster configurations; configuration parameters; cost-based performance model; de facto programming model; end-to-end autotuning flow; large-scale distributed data processing; machine learning-based autotuning approach; machine learning-based performance models; parameter tuning process; smart search algorithms; starfish autotuner; support vector regression model; Accuracy; Benchmark testing; Computational modeling; Data models; Training; Training data; Tuning; big data; distributed systems; hadoop; performance modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2013 IEEE 21st International Symposium on
Conference_Location :
San Francisco, CA
ISSN :
1526-7539
Type :
conf
DOI :
10.1109/MASCOTS.2013.9
Filename :
6730744
Link To Document :
بازگشت