Abstract :
For years, while the vast majority of users at the DoD HPCMP MSRC´s and DCs were serviced well by batch job queuing systems, a small minority of users, because their computing needs did not mesh well with the batch queuing system paradigm, kept asking for interactive access to the HPC machines. To address this need, in May 2005, the 256 processor ARL MSRC Linux Networx cluster, Powell, was transitioned to interactive use. Since no large HPCMP machine had operated this way before, a new approach that would regulate interactive usage in an automated fashion was needed. In response to that, the ARL MSRC has developed a Web-based reservation tool so that users may make advance reservations for nodes on Powell. This type of reservation system is new and quite different from any other queuing system model currently in use within the HPCMP. With just a batch queuing system, users submit jobs without having any predictable way to know when the jobs will actually run. But since a reservation system provides a means to reserve a specific number of processors at a specified time, users are able to easily plan for demonstrations, exercises or time-sensitive calculations. The reservation tool works in combination with the grid engine queuing system so that users have all the conveniences of a queuing system, provided they have made a reservation on the Web to acquire nodes. Using the queuing system has several benefits, such as starting the job, or jobs, when the reservation starts, generating the machine file for MPI jobs, and providing accounting services. Development of the reservation system has been a learning process for all involved. The objective of this paper is to provide a technical overview of this HPC reservation system. In addition, there will also be a discussion on how the system may be used to meet the demands of various types of research. The development cycle of the reservation system will be discussed, specifically giving a timeline as to when and why variou- s features were added. There will also be a discussion on the operational policies governing the system and the rationale for those policies. The data collected from the system, has been in production for a full year, will be analyzed to show various metrics related to utilization. This data will help to gauge the success of this program thus far
Keywords :
Internet; Linux; batch processing (computers); grid computing; interactive systems; military computing; reservation computer systems; Department of Defense; High Performance Computing Modernization Program; Linux Networx cluster; Major Shared Resource Center; Powell; Web-based HPC reservation system; batch job queuing systems; grid engine queuing system; Availability; Computer architecture; Computer interfaces; Databases; Engines; High performance computing; Laboratories; Linux; Military computing; Production systems;