• DocumentCode
    715519
  • Title

    JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

  • Author

    Taifi, Moussa ; Shi, Justin Y. ; Celik, Yasin

  • Author_Institution
    Comput. Sci. Dept., Temple Univ., Philadelphia, PA, USA
  • fYear
    2015
  • fDate
    March 30 2015-April 3 2015
  • Firstpage
    187
  • Lastpage
    194
  • Abstract
    Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.
  • Keywords
    fault tolerant computing; packet switching; parallel processing; Cassandra; JENERGY; MTBF; Zookeeper; decoupled computations; distributed tuple space; failure rates; fault tolerant stateless architecture; hardware failure; high-performance computing; large-scale HPC; mean time between failures; network congestion; network disconnections; packet-switching-inspired statistical multiplexing; parallel scientific applications; semantic data tuples; software failure; stateless parallel processing; tunable spatial redundancies; tunable temporal redundancies; volatility levels; Computer architecture; Fault tolerant systems; High performance computing; Redundancy; Switches; Fault tolerance; Performance of systems; Sustainable extreme scale HPC architecture; scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Service-Oriented System Engineering (SOSE), 2015 IEEE Symposium on
  • Conference_Location
    San Francisco Bay, CA
  • Type

    conf

  • DOI
    10.1109/SOSE.2015.18
  • Filename
    7133528