Title :
Identifying Failures in Grids through Monitoring and Ranking
Author :
Zeinalipour-Yazti, Demetrios ; Neocleous, Kyriacos ; Georgiou, Chryssis ; Dikaiakos, Marios D.
Author_Institution :
Pure & Appl. Sci., Open Univ. of Cyprus, Nicosia
Abstract :
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases. We believe that our work constitutes another important step towards realizing adaptive Grid computing systems.
Keywords :
grid computing; scheduling; system recovery; FailBase Repository; FailRank system; failures identification; grid computing systems; information sources; job scheduling resource pool; Application software; Computer applications; Computer networks; Computer science; Computerized monitoring; Condition monitoring; Feedback; Grid computing; Logic; Remote monitoring; Dependability; Grid Computing; Top-k Ranking;
Conference_Titel :
Network Computing and Applications, 2008. NCA '08. Seventh IEEE International Symposium on
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-3192-2
Electronic_ISBN :
978-0-7695-3192-2
DOI :
10.1109/NCA.2008.10