• DocumentCode
    228746
  • Title

    NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

  • Author

    Zhengzhang Chen ; Seung Woo Son ; Hendrix, William ; Agrawal, Ankit ; Wei-keng Liao ; Choudhary, Alok

  • Author_Institution
    Electr. Eng. & Comput. Sci. Dept., Northwestern Univ., Evanston, IL, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    733
  • Lastpage
    744
  • Abstract
    Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
  • Keywords
    checkpointing; data analysis; iterative methods; learning (artificial intelligence); parallel processing; software fault tolerance; HPC system; NUMARCK; Northwestern University machine learning algorithm for resiliency and check pointing; data analysis; fault tolerance technique; high performance computing; simulation iteration; Approximation algorithms; Approximation methods; Checkpointing; Computational modeling; Data models; Error analysis; Machine learning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.65
  • Filename
    7013047