DocumentCode :
267063
Title :
Reliability Guided Resource Allocation for Large-Scale Systems
Author :
Umamaheshwaran, Shruti ; Hacker, Thomas J.
Author_Institution :
Comput. & Inf. Technol., Purdue Univ., West Lafayette, IN, USA
fYear :
2014
fDate :
15-18 Dec. 2014
Firstpage :
334
Lastpage :
341
Abstract :
In high performance computing systems running on native hardware or cloud computing resources, parallel applications can reserve a large number of resources for long time periods. Resource failures trigger the failure of applications using this resource. Our investigation of large-scale systems in the field has revealed a difference in the operational reliability of nodes. By adding awareness of this difference to the scheduler along with the predicted reliability needs, we match reliable resources with the most demanding applications to reduce the probability of application failure. In this paper, we describe a new approach we developed to enhance reliability and reduce failure costs. Our approach partitions resources based on expected reliability and sizes each partition to bound the probability of blocking requests. Our approach can be used to size systems for peak loads with a bounded probability of blocking requests, and would be useful for operators seeking to improve the reliability and efficiency of systems.
Keywords :
cloud computing; large-scale systems; parallel processing; probability; reliability; resource allocation; system recovery; application failure probability; cloud computing resources; high performance computing systems; large-scale systems; operational reliability; parallel applications; reliability guided resource allocation; request blocking; resource failures; Checkpointing; Computer hacking; Equations; Fault tolerance; Fault tolerant systems; Resource management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on
Conference_Location :
Singapore
Type :
conf
DOI :
10.1109/CloudCom.2014.63
Filename :
7037686
Link To Document :
بازگشت