Title :
Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems
Author :
Liu, Yongpeng ; Zhu, Hong ; Liu, Yongyan ; Wang, Feng ; Fan, Baohua
Author_Institution :
Sch. of Comput. Sci., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
Check pointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, check pointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous pressure on the communication network and I/O subsystem because a massive demand of accesses are concentrated in a short period of time. Data compression can reduce the size of checkpoint data to be saved in the file system and to go through the communication network. However, compression induces a huge time overhead especially in large scale parallel systems, which is the main technical barrier of its practical usability. In this paper, we propose a parallel compression check pointing technique to reduce the time overhead in socket-level heterogeneous architectures. It integrates a number of parallel processing techniques, including transmitting checkpoint data between CPU, GPU and file system in double buffered pipelines, aggregating file write operations, SIMD parallel compression algorithm running on GPU, etc. The paper also reports an implementation of the technique on the Tianhe-1 supercomputer system and the evaluation experiments with the system. The experiment data show that the technique is efficient and practically usable.
Keywords :
checkpointing; computer graphic equipment; coprocessors; data compression; mainframes; parallel processing; CPU; GPU; SIMD parallel compression algorithm; Tianhe-1 supercomputer system; central processing unit; data compression; fault tolerant technique; file system; graphics processing unit; parallel compression checkpointing; parallel computing system; single instruction multiple data; socket-level heterogeneous system; Checkpointing; Communication networks; Computer architecture; Data compression; Graphics processing unit; Pipeline processing; Checkpoint and restart; Data compression; GPU; Pipeline; SIMD parallelism; Socket-level heterogeneous architecture;
Conference_Titel :
High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on
Conference_Location :
Banff, AB
Print_ISBN :
978-1-4577-1564-8
Electronic_ISBN :
978-0-7695-4538-7
DOI :
10.1109/HPCC.2011.68