Title :
Milieu: Lightweight and Configurable Big Data Provenance for Science
Author :
You-Wei Cheah ; Canon, Richard ; Plale, Beth ; Ramakrishnan, Lavanya
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
fDate :
June 27 2013-July 2 2013
Abstract :
The volume and complexity of data produced and analyzed in scientific collaborations is growing exponentially. It is important to track scientific data-intensive analysis workflows to provide context and reproducibility as data is transformed in these collaborations. Provenance addresses this need and aids scientists by providing the lineage or history of how data is generated, used and modified. Provenance has traditionally been collected at the workflow level often making it hard to capture relevant information about resource characteristics and is difficult for users to easily incorporate in existing workflows. In this paper, we describe Milieu, a framework focused on the collection of provenance for scientific experiments in High Performance Computing systems. Our approach collects provenance in a minimally intrusive way without significantly impacting the performance of the execution of scientific workflows. We also provide fidelity to our provenance collection by allowing users to specify three levels of provenance collection. We evaluate our framework on systems at the National Energy Research Scientific Computing Center (NERSC) and show that the overhead is less than the variation already experienced by these applications in these shared environments.
Keywords :
data handling; natural sciences computing; Milieu framework; NERSC; National Energy Research Scientific Computing Center; big data provenance; data generation; data modification; data use; data-intensive analysis workflow; scientific collaboration; Collaboration; Computer architecture; Context; Data models; Database languages; Instruments; Lattices; Database; High Performance Computing; Provenance;
Conference_Titel :
Big Data (BigData Congress), 2013 IEEE International Congress on
Conference_Location :
Santa Clara, CA
Print_ISBN :
978-0-7695-5006-0
DOI :
10.1109/BigData.Congress.2013.16