• DocumentCode
    2976987
  • Title

    Computational resiliency for distributed applications

  • Author

    McGill, Kathleen ; Taylor, Stephen

  • Author_Institution
    Thayer Sch. of Eng., Dartmouth Coll., Hanover, NH, USA
  • fYear
    2011
  • fDate
    7-10 Nov. 2011
  • Firstpage
    1472
  • Lastpage
    1479
  • Abstract
    In recent years, computer network attacks have decreased overall reliability of computer systems and undermined confidence in mission-critical software. These robustness issues are magnified in distributed applications, which provide multiple points of failure and attack. The notion of resiliency is concerned with constructing applications that are able to operate through a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance achieved through replication of resources. In general, these approaches provide graceful degradation of performance to the point of failure but do not guarantee progress in the presence of multiple cascading and recurrent failures. Our approach is to dynamically replicate message-passing processes, detect inconsistencies in their behavior, and restore the level of fault tolerance as a computation proceeds. This paper describes a novel operating system technology for resilient message-passing applications that is automated, scalable, and transparent. The technology provides mechanisms for process replication, process migration, and adaptive failure detection. To quantify the performance overhead of the technology, we benchmark a distributed application exemplar to represent a broader class of applications.
  • Keywords
    computer network security; fault tolerant computing; message passing; operating systems (computers); safety-critical software; cascading failures; computational resiliency; computer network attacks; computer system reliability; distributed applications; fault tolerance; malicious attacks; mission-critical software; operating system technology; recurrent failures; resilient message-passing applications; Delay; Fault tolerance; Fault tolerant systems; Kernel; Libraries; Protocols; Sockets; distributed systems; failure detection; mission-assurance; process migration; process replication; resiliency;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    MILITARY COMMUNICATIONS CONFERENCE, 2011 - MILCOM 2011
  • Conference_Location
    Baltimore, MD
  • ISSN
    2155-7578
  • Print_ISBN
    978-1-4673-0079-7
  • Type

    conf

  • DOI
    10.1109/MILCOM.2011.6127514
  • Filename
    6127514