• DocumentCode
    3525199
  • Title

    Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework

  • Author

    Jian Lin ; Fan Liang ; Xiaoyi Lu ; Li Zha ; Zhiwei Xu

  • Author_Institution
    Inst. of Comput. Technol., Beijing, China
  • fYear
    2015
  • fDate
    March 30 2015-April 2 2015
  • Firstpage
    176
  • Lastpage
    183
  • Abstract
    Fault-tolerance is a significant property for distributed and parallel computing systems. An emerging trend of Big Data computing is to combine MPI and MapReduce technologies in a single framework. The distinctive state model in this kind of frameworks brings challenges to designing an efficient and transparent fault-tolerance mechanism. In this paper, a state model analysis method is proposed for uniformly modeling independent MPI, MapReduce and MPI-based MapReduce data computing frameworks. Based on this analysis, a library-level fault-tolerance mechanism with global persistent state model is proposed, a data-staging and routine-sharing based checkpoint approach is designed within this mechanism. The proposed mechanism has been implemented in DataMPI, a communication library supporting MPI-based MapReduce data computing applications. The experiments show that it can transparently enable fault-tolerance for applications. Taking TeraSort as an example, it introduces only 6.8% time overhead and 11% space overhead. For a failure-resume execution, it has a 10%-32% performance advantage compared with the naive checkpoint solutions based on local or parallel storages. The proposed mechanism also provides superior performance and resource utilization compared with Hadoop for both fault-free and failure-resume executions.
  • Keywords
    Big Data; application program interfaces; message passing; parallel processing; software fault tolerance; Big Data computing; DataMPI communication library; Hadoop; MPI-based MapReduce data computing framework; data-staging based checkpoint approach; distinctive state model; distributed computing systems; failure-resume execution; global persistent state model; library-level fault-tolerance mechanism; naive checkpoint solutions; parallel computing systems; parallel storages; routine-sharing based checkpoint approach; state model analysis method; Analytical models; Business; Computational modeling; Data models; Fault tolerance; Fault tolerant systems; Synchronization; MPI; MapReduce; checkpoint; data computing; fault-tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on
  • Conference_Location
    Redwood City, CA
  • Type

    conf

  • DOI
    10.1109/BigDataService.2015.33
  • Filename
    7184879