Title :
Replication-Based Fault-Tolerance for Large-Scale Graph Processing
Author :
Peng Wang ; Kaiyuan Zhang ; Rong Chen ; Haibo Chen ; Haibing Guan
Author_Institution :
Inst. of Parallel & Distrib. Syst., Shanghai Jiao Tong Univ., Shanghai, China
Abstract :
The increasing algorithm complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a must due to the increasing scale of machines. Unfortunately, existing large-scale graph-parallel systems usually adopt a distributed checkpoint mechanism for fault tolerance, which incurs not only notable performance overhead but also lengthy recovery time. This paper observes that the vertex replicas created for distributed graph computation can be naturally extended for fast in-memory recovery of graph states. This paper proposes Imitator, a new fault tolerance mechanism, that supports cheaply maintenance of vertex states by replicating vertex states to their replicas during normal message exchanges, and provides fast in-memory reconstruction of failed vertices from replicas in other machines. Imitator has been implemented by extending Hama, a popular open-source clone of Pregel. Evaluation shows that Imitator incurs negligible performance overhead (less than 5% for all cases) and can recover from failures of more than one million of vertices with less than 3.4 seconds.
Keywords :
fault tolerant computing; parallel algorithms; system recovery; Hama; Imitator; algorithm complexity; distributed checkpoint mechanism; distributed graph computation; failure recovery; graph state in-memory recovery; graph-parallel algorithms; large-scale graph processing; large-scale graph-parallel systems; networked machines; open-source Pregel clone; replication-based fault-tolerance; vertex replicas; vertex state maintenance; vertex state replication; Checkpointing; Clustering algorithms; Computational modeling; Computer crashes; Fault tolerance; Fault tolerant systems; Synchronization; fault-tolerance; graph-parallel system;
Conference_Titel :
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location :
Atlanta, GA
DOI :
10.1109/DSN.2014.58