• DocumentCode
    2860387
  • Title

    An Application Level Approach for Proactive Process Migration in MPI Applications

  • Author

    Cores, Iván ; Rodríguez, Gabriel ; Gonzalez, P. ; Martín, María J.

  • Author_Institution
    Comput. Archit. Group, Univ. of A Coruna, A Coruna, Spain
  • fYear
    2011
  • fDate
    20-22 Oct. 2011
  • Firstpage
    400
  • Lastpage
    405
  • Abstract
    The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.
  • Keywords
    application program interfaces; checkpointing; failure analysis; fault tolerance; message passing; parallel processing; program compilers; MPI application; application level approach; checkpointing recovery; compiler tool; engineering parallel application; fault-tolerant application; hardware failure; large-scale computational science; machine failure; mean-time-between-failure; parallel application; parallel communication API; proactive process migration; rollback recovery; runtime library; single process migration; Checkpointing; Fault tolerance; Fault tolerant systems; Process control; Program processors; Proposals; Protocols; Checkpointing and Restart; Fault Tolerance; MPI; Process Migration;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on
  • Conference_Location
    Gwangju
  • Print_ISBN
    978-1-4577-1807-6
  • Type

    conf

  • DOI
    10.1109/PDCAT.2011.16
  • Filename
    6118549