• DocumentCode
    3420133
  • Title

    Supporting fault-tolerance in heterogeneous distributed applications

  • Author

    Maheshwari, Piyush ; Ouyang, Jinsong

  • Author_Institution
    Sch. of Comput. Sci. & Eng., New South Wales Univ., Sydney, NSW, Australia
  • fYear
    1997
  • fDate
    35521
  • Firstpage
    195
  • Lastpage
    207
  • Abstract
    Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications
  • Keywords
    distributed processing; message passing; open systems; protocols; scheduling; software fault tolerance; software libraries; software performance evaluation; system recovery; H-Libra; algorithm design; distributed consistent checkpointing; heterogeneous distributed applications; high-level message-passing interface; interconnection network; library prototype; low cost; parallel processing; parallel task scheduling; programming; reliable distributed heterogeneous computing; rollback-recovery; run time performance; software fault-tolerance; user-level network communication protocol; Application software; Computer networks; Concurrent computing; Costs; Distributed computing; Distributed processing; Fault tolerance; Fault tolerant systems; Process design; Telecommunication network reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Heterogeneous Computing Workshop, 1997. (HCW '97) Proceedings., Sixth
  • Conference_Location
    Geneva
  • Print_ISBN
    0-8186-7879-8
  • Type

    conf

  • DOI
    10.1109/HCW.1997.581421
  • Filename
    581421