Author :
Rosa, Andrea ; Chen, Lydia Y. ; Binder, Walter
Author_Institution :
Fac. of Inf., Univ. della Svizzera italiana, Lugano, Switzerland
Abstract :
In large-scale data enters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big data enters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fan-out. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminate Analysis (LDA), Quadratic Discriminate Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoffs between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.
Keywords :
Big Data; computer centres; learning (artificial intelligence); pattern clustering; regression analysis; scheduling; software fault tolerance; Big Data clusters; Big-Data jobs; Google cluster traces; LDA; QDA; datacenter workload characterization; delay-based mitigation policy; failed jobs time-varying patterns; hardware failures; job execution failures; job terminations; jobs failures mitigation; jobs failures prediction; key job attributes; large-scale data enters; linear discriminate analysis; logistic regression; mitigation policies; online predictive model; performance deterioration; prediction models; quadratic discriminate analysis; resource inefficiency; resource waste; software failures; statistical learning techniques; Google; Measurement; Predictive models; Random access memory; Throughput; Time-varying systems; Training;