• DocumentCode
    2145538
  • Title

    A work-stealing scheduling framework supporting fault tolerance

  • Author

    Wang, Yizhuo ; Ji, Weixing ; Shi, Feng ; Zuo, Qi

  • Author_Institution
    School of Computer Science and Technology, Beijing Institute of Technology, China
  • fYear
    2013
  • fDate
    18-22 March 2013
  • Firstpage
    695
  • Lastpage
    700
  • Abstract
    Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.
  • Keywords
    Checkpointing; Computer crashes; Fault tolerance; Fault tolerant systems; Multicore processing; Parallel processing; Transient analysis; cluster; fault tolerance; multicore; work-stealing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
  • Conference_Location
    Grenoble, France
  • ISSN
    1530-1591
  • Print_ISBN
    978-1-4673-5071-6
  • Type

    conf

  • DOI
    10.7873/DATE.2013.150
  • Filename
    6513596