DocumentCode
3696966
Title
Performance Prediction for Apache Spark Platform
Author
Kewen Wang;Mohammad Maifi Hasan Khan
Author_Institution
Dept. of Comput. Sci. &
fYear
2015
Firstpage
166
Lastpage
173
Abstract
Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.
Keywords
"Sparks","Predictive models","Data models","Computational modeling","Accuracy","Memory management","Measurement"
Publisher
ieee
Conference_Titel
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
Type
conf
DOI
10.1109/HPCC-CSS-ICESS.2015.246
Filename
7336160
Link To Document