• DocumentCode
    3461240
  • Title

    m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data

  • Author

    Kalavri, Vasiliki ; Hui Shang ; Vlassov, Vladimir

  • Author_Institution
    Sch. of Inf. & Commun. Technol., KTH R. Inst. of Technol., Stockholm, Sweden
  • fYear
    2013
  • fDate
    3-5 Dec. 2013
  • Firstpage
    894
  • Lastpage
    901
  • Abstract
    High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a set of compilers, which transform queries into execution plans and submit them to a distributed engine for execution. Apart from the useful abstraction and support for common analysis operations, high-level processing systems also offer great opportunities for automatic optimizations. Existing studies on execution traces from big data centers and industrial clusters show that there is significant computation redundancy in analysis programs, i.e., there exist similar or even identical queries on the same datasets in different jobs. Furthermore, workload characterization of MapReduce traces from large organizations suggest that there is a big need for caching job results, that will enable their reuse and improve execution time. In this paper, we propose m2r2, an extensible and language-independent framework for results materialization and reuse in high-level dataflow systems for big data analytics. Our prototype implementation is built on top of the Pig dataflow system and handles automatic results caching, common sub-query matching and rewriting, as well as garbage collection. We have evaluated m2r2 using the TPC-H benchmark for Pig and report reduced query execution time by 65% on average.
  • Keywords
    Big Data; data flow computing; query languages; query processing; rewriting systems; High-level parallel dataflow system; Hive; MapReduce; Pig dataflow system; TPC-H benchmark; big data processing; declarative query language; m2r2; materialize-match-rewrite-reuse; Benchmark testing; Data handling; Data storage systems; Engines; Information management; Optimization; Prototypes; computation redundancies; materialization; results reuse;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
  • Conference_Location
    Sydney, NSW
  • Type

    conf

  • DOI
    10.1109/CSE.2013.134
  • Filename
    6755314