DocumentCode
2145538
Title
A work-stealing scheduling framework supporting fault tolerance
Author
Wang, Yizhuo ; Ji, Weixing ; Shi, Feng ; Zuo, Qi
Author_Institution
School of Computer Science and Technology, Beijing Institute of Technology, China
fYear
2013
fDate
18-22 March 2013
Firstpage
695
Lastpage
700
Abstract
Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.
Keywords
Checkpointing; Computer crashes; Fault tolerance; Fault tolerant systems; Multicore processing; Parallel processing; Transient analysis; cluster; fault tolerance; multicore; work-stealing;
fLanguage
English
Publisher
ieee
Conference_Titel
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
Conference_Location
Grenoble, France
ISSN
1530-1591
Print_ISBN
978-1-4673-5071-6
Type
conf
DOI
10.7873/DATE.2013.150
Filename
6513596
Link To Document