DocumentCode
1933641
Title
Monitoring High Performance Networks in Large-scale Clusters
Author
Gadaud, Fabrice
Author_Institution
Commissariat a l´´Energie Atomique, Paris
Volume
2
fYear
2006
fDate
16-19 May 2006
Firstpage
32
Lastpage
32
Abstract
The number of large-scale clusters is rising. They are included into grids or become key components of large structures. As more users and projects rely on RFC clusters, high availability and security are requirements for a fast growing adoption and use. In this paper, we, focus on high performance networks. All HPC clusters are built on top of them. We demonstrate that classical instrumentations are inefficient in HPC environment, they do not scale or cause a significant loss of performance. Based on this fact, we highlight clusters properties; nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose an instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture, and performance constraints. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducing recovery time from failure and security risk of a whole cluster. We illustrate our methodology on QsNet by K. Magontis et al. (2001) network and provide a way to increase safety of high performance networks and clusters
Keywords
program verification; resource allocation; security of data; system recovery; telecommunication security; workstation clusters; HPC clusters; RFC clusters; availability requirements; coherency verification; containment verification; failure risk; high performance networks; large-scale clusters; model checking; performance constraints; recovery time reduction; resource usage; security requirements for; security risk; system architecture; Computer industry; Condition monitoring; Grid computing; High performance computing; Instruments; Intelligent networks; Kernel; Large-scale systems; Message passing; Safety;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on
Conference_Location
Singapore
Print_ISBN
0-7695-2585-7
Type
conf
DOI
10.1109/CCGRID.2006.1630927
Filename
1630927
Link To Document