• DocumentCode
    633064
  • Title

    Milieu: Lightweight and Configurable Big Data Provenance for Science

  • Author

    You-Wei Cheah ; Canon, Richard ; Plale, Beth ; Ramakrishnan, Lavanya

  • Author_Institution
    Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
  • fYear
    2013
  • fDate
    June 27 2013-July 2 2013
  • Firstpage
    46
  • Lastpage
    53
  • Abstract
    The volume and complexity of data produced and analyzed in scientific collaborations is growing exponentially. It is important to track scientific data-intensive analysis workflows to provide context and reproducibility as data is transformed in these collaborations. Provenance addresses this need and aids scientists by providing the lineage or history of how data is generated, used and modified. Provenance has traditionally been collected at the workflow level often making it hard to capture relevant information about resource characteristics and is difficult for users to easily incorporate in existing workflows. In this paper, we describe Milieu, a framework focused on the collection of provenance for scientific experiments in High Performance Computing systems. Our approach collects provenance in a minimally intrusive way without significantly impacting the performance of the execution of scientific workflows. We also provide fidelity to our provenance collection by allowing users to specify three levels of provenance collection. We evaluate our framework on systems at the National Energy Research Scientific Computing Center (NERSC) and show that the overhead is less than the variation already experienced by these applications in these shared environments.
  • Keywords
    data handling; natural sciences computing; Milieu framework; NERSC; National Energy Research Scientific Computing Center; big data provenance; data generation; data modification; data use; data-intensive analysis workflow; scientific collaboration; Collaboration; Computer architecture; Context; Data models; Database languages; Instruments; Lattices; Database; High Performance Computing; Provenance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2013 IEEE International Congress on
  • Conference_Location
    Santa Clara, CA
  • Print_ISBN
    978-0-7695-5006-0
  • Type

    conf

  • DOI
    10.1109/BigData.Congress.2013.16
  • Filename
    6597118