• DocumentCode
    6750
  • Title

    Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications

  • Author

    Xuanhua Shi ; Ming Chen ; Ligang He ; Xu Xie ; Lu Lu ; Hai Jin ; Yong Chen ; Song Wu

  • Author_Institution
    Cluster & Grid Comput. Lab. in the Sch. of Comput. Sci. & Technol., Huazhong Univ. of Sci. & Technol., Wuhan, China
  • Volume
    26
  • Issue
    8
  • fYear
    2015
  • fDate
    Aug. 1 2015
  • Firstpage
    2300
  • Lastpage
    2315
  • Abstract
    The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative - pplications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.
  • Keywords
    data analysis; multi-threading; parallel processing; HPC system; Hadoop programs; I-O intensive applications; Mammoth system; PageRank; batch processing applications; data analysis; execution units; global memory management; hardware configurations; high performance computing system; interactive applications; iterative applications; large-scale data processing; memory allocation; memory-constrained systems; memory-intensive MapReduce applications; multi-threaded execution engine; rule-based heuristic; Data processing; Data structures; Educational institutions; Engines; Memory management; Receivers; Runtime; HPC; MapReduce; data processing;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2014.2345068
  • Filename
    6869021