DocumentCode :
3549471
Title :
Probabilistic QoS guarantees for supercomputing systems
Author :
Oliner, A.J. ; Rudolph, L. ; Sahoo, R.K. ; Moreira, J.E. ; Gupta, M.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Massachusetts Inst. of Technol., Cambridge, MA, USA
fYear :
2005
fDate :
28 June-1 July 2005
Firstpage :
634
Lastpage :
643
Abstract :
Supercomputing systems must be able to reliably and efficiently complete their assigned workloads, even in the presence of failures. This paper proposes a system that allows the system and users to negotiate a mutually desirable risk strategy; in order to accomplish this, the system makes probabilistic guarantees on quality of service (QoS), of the form, "Job j can be completed by deadline d with probability p". In order to make such guarantees, the system uses event prediction (forecasting) in conjunction with fault-aware job scheduling and cooperative checkpointing strategies. Using job logs and failure traces from actual high performance computing systems, we employ trace-based simulations to assess the effects of the prediction accuracy (a) and user risk strategy (U) on a variety of performance metrics. Compared to a system that does not use event prediction, a high forecasting accuracy resulted in QoS and utilization improvements of as much as 6%, along with an 89% reduction in the amount of lost work. Therefore, our results show that a system that makes probabilistic QoS guarantees using a market-based scheduling approach can increase both system performance and reliability.
Keywords :
checkpointing; fault tolerant computing; probability; quality of service; scheduling; cooperative checkpointing strategies; event prediction; fault-aware job scheduling; market-based scheduling approach; probabilistic QoS; quality of service; risk strategy; supercomputing systems; system performance; system reliability; trace-based simulation; Accuracy; Checkpointing; Computational modeling; Economic forecasting; High performance computing; Measurement; Predictive models; Processor scheduling; Quality of service; System performance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
Print_ISBN :
0-7695-2282-3
Type :
conf
DOI :
10.1109/DSN.2005.80
Filename :
1467837
Link To Document :
بازگشت