• DocumentCode
    1914865
  • Title

    Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

  • Author

    Niederbrucker, Gerhard ; Strakova, H. ; Gansterer, Wilfried N.

  • Author_Institution
    Austria Res. Group Theor. & Applic. of Algorithms, Univ. of Vienna, Vienna, Austria
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    643
  • Lastpage
    651
  • Abstract
    Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the pushflow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
  • Keywords
    mathematics computing; matrix decomposition; parallel algorithms; software fault tolerance; PCF algorithm; bit flip; distributed QR factorization process; distributed reduction algorithm; fault tolerance; link failure tolerance; matrix operation; message loss; node failure tolerance; parallel reduction algorithm; push-cancel-flow algorithm; pushflow algorithm; push-cancel-flow algorithm; push-flow algorithm; push-sum algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
  • Conference_Location
    Salt Lake City, UT
  • Print_ISBN
    978-1-4673-6218-4
  • Type

    conf

  • DOI
    10.1109/SC.Companion.2012.89
  • Filename
    6495871