• DocumentCode
    3026964
  • Title

    Monitoring and debugging parallel software with BCS-MPI on large-scale clusters

  • Author

    Fernández, Juan ; Petrini, Fabrizio ; Frachtenberg, Eitan

  • Author_Institution
    Departamento de Ingenieria y Tecnologia de Computadores, Murcia Univ., Spain
  • fYear
    2005
  • fDate
    4-8 April 2005
  • Abstract
    Buffered coscheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a very fine granularity. Thus, BCS-MPI provides a system that is much more controllable and deterministic. BCS-MPI leverages this regular behavior to provide a simple yet powerful monitoring and debugging subsystem that streamlines the analysis of parallel software. This subsystem, called monitoring and debugging system (MDS), provides exhaustive process and communication scheduling statistics. This paper covers in detail the design and implementation of the MDS subsystem, and demonstrates how the MDS can be used to monitor and debug not only parallel MPI applications but also the BCS-MPI runtime system itself. Additionally, we show that this functionality need not come at a significant performance loss.
  • Keywords
    application program interfaces; message passing; parallel programming; program debugging; scheduling; synchronisation; system monitoring; buffered coscheduled MPI; large-scale clusters; message passing interface; parallel software; program debugging; program monitoring; synchronization; Application software; Clustering algorithms; Communication system control; Concurrent computing; Control systems; Laboratories; Large-scale systems; Monitoring; Software debugging; System software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
  • Print_ISBN
    0-7695-2312-9
  • Type

    conf

  • DOI
    10.1109/IPDPS.2005.295
  • Filename
    1420277