DocumentCode
228746
Title
NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing
Author
Zhengzhang Chen ; Seung Woo Son ; Hendrix, William ; Agrawal, Ankit ; Wei-keng Liao ; Choudhary, Alok
Author_Institution
Electr. Eng. & Comput. Sci. Dept., Northwestern Univ., Evanston, IL, USA
fYear
2014
fDate
16-21 Nov. 2014
Firstpage
733
Lastpage
744
Abstract
Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
Keywords
checkpointing; data analysis; iterative methods; learning (artificial intelligence); parallel processing; software fault tolerance; HPC system; NUMARCK; Northwestern University machine learning algorithm for resiliency and check pointing; data analysis; fault tolerance technique; high performance computing; simulation iteration; Approximation algorithms; Approximation methods; Checkpointing; Computational modeling; Data models; Error analysis; Machine learning algorithms;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location
New Orleans, LA
Print_ISBN
978-1-4799-5499-5
Type
conf
DOI
10.1109/SC.2014.65
Filename
7013047
Link To Document