• DocumentCode
    2283344
  • Title

    Application-Level Correctness and its Impact on Fault Tolerance

  • Author

    Li, Xuanhua ; Yeung, Donald

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Maryland Univ., College Park, MD
  • fYear
    2007
  • fDate
    10-14 Feb. 2007
  • Firstpage
    181
  • Lastpage
    192
  • Abstract
    Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user´s perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which correctness is evaluated, with more faults being benign at higher levels of abstraction, i.e. at the user or application level, compared to lower levels of abstraction, i.e. at the architecture level. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations - e.g., SPECInt CPU2000. This paper investigates definitions of program correctness that view correctness from the application´s standpoint rather than the architecture´s standpoint. Under application-level correctness, a program´s execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrect faults are corre- ct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of architecturally incorrect faults are correct at the application level. We also present a lightweight fault recovery mechanism that exploits the relaxed requirements on numerical integrity provided by application-level correctness to reduce checkpoint cost. Our lightweight fault recovery mechanism successfully recovers 66.3% of program crashes in our multimedia and AI workloads, while incurring minimum runtime overhead
  • Keywords
    program verification; software architecture; software fault tolerance; software metrics; application-level correctness; application-level fidelity metrics; artificial intelligence workloads; enhanced fault resilience; error resilience; fault recovery mechanism; fault tolerance; multimedia workloads; program correctness; programs soft computation; user satisfaction; user-perceived program solution quality; Artificial intelligence; Computer architecture; Costs; Educational institutions; Engineering profession; Error correction; Fault tolerance; Government; Hardware; Resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on
  • Conference_Location
    Scottsdale, AZ
  • Print_ISBN
    1-4244-0805-9
  • Electronic_ISBN
    1-4244-0805-9
  • Type

    conf

  • DOI
    10.1109/HPCA.2007.346196
  • Filename
    4147659