• DocumentCode
    85808
  • Title

    Systematic Debugging Methods for Large-Scale HPC Computational Frameworks

  • Author

    Humphrey, Alan ; Qingyu Meng ; Berzins, Martin ; Caminha B. de Oliveira, Diego ; Rakamaric, Zvonimir ; Gopalakrishnan, Ganesh

  • Author_Institution
    Univ. of Utah, Salt Lake City, UT, USA
  • Volume
    16
  • Issue
    3
  • fYear
    2014
  • fDate
    May-June 2014
  • Firstpage
    48
  • Lastpage
    56
  • Abstract
    Parallel computational frameworks for high-performance computing are central to the advancement of simulation-based studies in science and engineering. Unfortunately, finding and fixing bugs in these frameworks can be extremely time consuming. Left unchecked, these bugs can drastically diminish the amount of new science that can be performed. This article presents a systematic study of the Uintah Computational Framework and approaches to debug it more incisively. A key insight is to leverage the modular structure of Uintah, which lends itself to systematic debugging. In particular, the authors have developed a new approach based on coalesced stack trace graphs (CSTG) that summarize the system behavior in terms of key control flows manifested through function invocation chains. They illustrate several scenarios for how CSTGs could help efficiently localize bugs, and present a case study of how they found and fixed a real Uintah bug using CSTGs.
  • Keywords
    graph theory; parallel programming; program debugging; CSTG; Uintah bug; Uintah computational framework; bugs fixing; bugs localization; coalesced stack trace graphs; engineering; function invocation chains; high-performance computing; large-scale HPC computational frameworks; modular structure; parallel computational frameworks; science; simulation-based studies; system behavior; systematic debugging methods; Computational modeling; Computer bugs; Debugging; Runtime; Scientific computing; Software development; Systematics; computational modeling and frameworks; debugging aids; parallel programming; reliability; scientific computing;
  • fLanguage
    English
  • Journal_Title
    Computing in Science & Engineering
  • Publisher
    ieee
  • ISSN
    1521-9615
  • Type

    jour

  • DOI
    10.1109/MCSE.2014.11
  • Filename
    6729885