• DocumentCode
    3469840
  • Title

    With great reliability comes great responsibility: tradeoffs of run-time policy on high reliability systems

  • Author

    Kleban, Stephen D. ; Johnston, Jeanette R. ; Ang, James A. ; Clearwater, S.H.

  • Author_Institution
    Sandia Nat. Labs., Albuquerque, NM, USA
  • fYear
    2004
  • fDate
    19-22 April 2004
  • Firstpage
    547
  • Lastpage
    554
  • Abstract
    In this paper we describe a simulation study to improve performance on a large highly utilized cluster at Sandia National Laboratories. The unique characteristic about the cluster is that there are very few constraints on job size. In particular, the run-time is limited only by system times which occur about every two weeks. The major contribution of this paper is that we quantify the difference in makespan between running a single long job and its equivalent in many shorter jobs. We find that running longer jobs is beneficial to the facility as a whole when the cycle-weighted makespans are considered and that running shorter jobs has an overall beneficial effect on the makespan for the jobs taken unweighted and for most users.
  • Keywords
    computer network reliability; performance evaluation; scheduling; workstation clusters; Sandia National Laboratories; cycle-weighted makespans; high reliability systems; highly utilized cluster; job size; performance; run-time policy; simulation study; Computerized monitoring; Delay; Failure analysis; Laboratories; Occupational stress; Performance analysis; Productivity; Runtime; Scalability; Supercomputers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on
  • Print_ISBN
    0-7803-8430-X
  • Type

    conf

  • DOI
    10.1109/CCGrid.2004.1336653
  • Filename
    1336653