• DocumentCode
    723693
  • Title

    Quiet Neighborhoods: Key to Protect Job Performance Predictability

  • Author

    Jokanovic, Ana ; Sancho, Jose Carlos ; Rodriguez, German ; Lucero, Alejandro ; Minkenberg, Cyriel ; Labarta, Jesus

  • Author_Institution
    Barcelona Supercomput. Center, Barcelona, Spain
  • fYear
    2015
  • fDate
    25-29 May 2015
  • Firstpage
    449
  • Lastpage
    459
  • Abstract
    Interference of nearby jobs has been recently identified as the dominant reason for the high performance variability of parallel applications running on High Performance Computing (HPC) systems. Typically, HPC systems are dynamic with multiple jobs coming and leaving in an unpredictable fashion, sharing simultaneously the system interconnection network. In such environment contention for network resources is causing random stalls in the progress of application execution degrading application and system performance overall. Eliminating job interactions in their neighbourhoods is key for guaranteeing performance predictability of applications. In this paper we are proposing the concept of quiet neighbourhoods that significantly reduce job interactions. Quiet neighbourhoods are created by the system resource manager in two phases. First, multiple virtual network blocks are defined on the top of the physical network resources based on typical workload distributions. Second, newly arriving jobs are allocated in these virtual blocks based on their size.
  • Keywords
    parallel processing; resource allocation; HPC systems; high performance computing system; high performance variability; job interference; job performance predictability; network resources; parallel applications; quiet neighbourhoods concept; system interconnection network; system resource manager; virtual network blocks; workload distribution; Interference; Network topology; Resource management; Software; Supercomputers; Topology; Vegetation; Applications´ interference; Infiniband; Network contention; Performance predictability; Performance reproducibility; Resource management; Virtual network topologies;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
  • Conference_Location
    Hyderabad
  • ISSN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2015.87
  • Filename
    7161533