DocumentCode
659417
Title
Feliss: Flexible distributed computing framework with light-weight checkpointing
Author
Araki, Takeshi ; Narita, Kazuyo ; Tamano, Hiroshi
Author_Institution
Cloud Syst. Res. Labs., NEC Corp., Japan
fYear
2013
fDate
6-9 Oct. 2013
Firstpage
143
Lastpage
149
Abstract
Current distributed computing frameworks, such as MapReduce and Spark, allow programmers to use only limited operations defined by the framework. Because of this restriction, algorithms that do not fit with the framework cannot be efficiently expressed. This restriction arose from the need of fault-tolerance. That is, these frameworks recover lost data by re-computing them from available data when a fault occurs. To ensure this mechanism works correctly, only operations provided by the system can be used. On the other hand, there is another fault-tolerance method called checkpointing. Since it achieves fault-tolerance by saving memory contents, there is no such limitation to operations. However, the cost of saving a memory image is high. To overcome this trade-off, we propose a light-weight checkpointing method called continuation-based checkpointing, which enables low overhead fault-tolerance without any restriction. It saves only the information that is necessary for restarting, which significantly reduces the cost of checkpointing. We implemented a distributed computing framework called Feliss by using our continuation-based checkpointing method, which includes an improved MapReduce without the above restriction and a message passing interface (MPI) subset. We evaluated Feliss with various applications and showed that order-of-magnitude speedup can be attained with applications that cannot be expressed efficiently with current frameworks.
Keywords
fault tolerant computing; message passing; Feliss; MPI subset; MapReduce; Spark; continuation-based checkpointing; fault-tolerance method; flexible distributed computing framework; lightweight checkpointing method; memory contents; memory image; message passing interface subset; Checkpointing; Data structures; Fault tolerance; Fault tolerant systems; Libraries; Servers; Sparks;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data, 2013 IEEE International Conference on
Conference_Location
Silicon Valley, CA
Type
conf
DOI
10.1109/BigData.2013.6691566
Filename
6691566
Link To Document