Title :
Rethinking data management for big data scientific workflows
Author :
Vahi, Karan ; Rynge, Mats ; Juve, Gideon ; Mayani, Rajiv ; Deelman, Ewa
Author_Institution :
Inf. Sci. Inst., Univ. of Southern California, Marina Del Rey, CA, USA
Abstract :
Scientific workflows consist of tasks that operate on input data to generate new data products that are used by subsequent tasks. Workflow management systems typically stage data to computational sites before invoking the necessary computations. In some cases data may be accessed using remote I/O. There are limitations with these approaches, however. First, the storage at a computational site may be limited and not able to accommodate the necessary input and intermediate data. Second, even if there is enough storage, it is sometimes managed by a filesystem with limited scalability. In recent years, object stores have been shown to provide a scalable way to store and access large datasets, however, they provide a limited set of operations (retrieve, store and delete) that do not always match the requirements of the workflow tasks. In this paper, we show how scientific workflows can take advantage of the capabilities of object stores without requiring users to modify their workflow-based applications or scientific codes. We present two general approaches, one that exclusively uses object stores to store all the files accessed and generated by a workflow, while the other relies on the shared filesystem for caching intermediate data sets. We have implemented both of these approaches in the Pegasus Workflow Management System and have used them to execute workflows in variety of execution environments ranging from traditional supercomputing environments that have a shared filesystem to dynamic environments like Amazon AWS and the Open Science Grid that only offer remote object stores. As a result, Pegasus users can easily migrate their applications from a shared filesystem deployment to one using object stores without changing their application codes.
Keywords :
Big Data; cache storage; natural sciences computing; workflow management software; Amazon AWS; Big data scientific workflows; Open Science Grid; Pegasus Workflow Management System; data management; data products; intermediate data set caching; object stores; remote I-O; scientific codes; shared filesystem deployment; supercomputing environments; workflow-based applications; Catalogs; Distributed databases; File systems; Information management; Runtime; Servers; Workflow management software; Pegasus; Pegasus Lite; cloud; data management; data staging site; object stores; workflows;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691724