• DocumentCode
    2197149
  • Title

    Error scope on a computational grid: theory and practice

  • Author

    Thain, Douglas ; Livny, Miron

  • Author_Institution
    Dept. of Comput. Sci., Wisconsin Univ., Madison, WI, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    199
  • Lastpage
    208
  • Abstract
    Error propagation is a central problem in grid computing. We re-learned this while adding a Java feature to the Condor computational grid. Our initial experience with the system was negative, due to the large number of new ways in which the system could fail. To reason about this problem, we developed a theory of error propagation. Central to our theory is the concept of an error´s scope, defined as the portion of a system that it invalidates. With this theory in hand, we recognized that the expanded system did not properly consider the scope of errors it discovered. We modified the system according to our theory, and succeeded in making it a more robust platform for distributed computing.
  • Keywords
    distributed processing; error analysis; object-oriented programming; software architecture; software fault tolerance; Condor distributed batch system; Java programs; computational grid; distributed architecture; distributed computing; error propagation; fault-tolerance; software engineering; Computer architecture; Computer errors; Distributed computing; Fault tolerance; Grid computing; Inspection; Java; Robustness; Secure storage; Virtual machining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Distributed Computing, 2002. HPDC-11 2002. Proceedings. 11th IEEE International Symposium on
  • ISSN
    1082-8907
  • Print_ISBN
    0-7695-1686-6
  • Type

    conf

  • DOI
    10.1109/HPDC.2002.1029919
  • Filename
    1029919