Performance Prediction for Apache Spark Platform

Author

Kewen Wang;Mohammad Maifi Hasan Khan

Author_Institution

Dept. of Comput. Sci. &

fYear

2015

Firstpage

166

Lastpage

173

Abstract

Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.

Keywords

"Sparks","Predictive models","Data models","Computational modeling","Accuracy","Memory management","Measurement"

Publisher

ieee

Conference_Titel

High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on

Type

conf

DOI

10.1109/HPCC-CSS-ICESS.2015.246

Filename

7336160