• DocumentCode
    660557
  • Title

    SEDGE: Symbolic example data generation for dataflow programs

  • Author

    Kaituo Li ; Reichenbach, Christoph ; Smaragdakis, Yannis ; Diao, Yixin ; Csallner, Christoph

  • Author_Institution
    Comput. Sci. Dept., Univ. of Massachusetts, Amherst, MA, USA
  • fYear
    2013
  • fDate
    11-15 Nov. 2013
  • Firstpage
    235
  • Lastpage
    245
  • Abstract
    Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. “concolic”) execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.
  • Keywords
    data flow analysis; program testing; programming languages; reasoning about programs; specification languages; Hadoop map-reduce programs; SDSS benchmarks; SEDGE system; SMT solver; automatic testing; complex programs; concolic execution; conventional programming language; dataflow domain; dataflow operator; dataflow programs; high-level dataflow language; map-reduce code; mapreduce programs; operator use; pig platform; state-of-the-art dynamic-symbolic execution; symbolic constraints; symbolic example data generation; symbolic reasoning engine; test cases; Benchmark testing; Cognition; Concrete; Data processing; Educational institutions; Extraterrestrial measurements; Programming;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/ASE.2013.6693083
  • Filename
    6693083