DocumentCode
3203073
Title
Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows
Author
Agun, Michael ; Bowers, Shawn
fYear
2011
fDate
4-9 July 2011
Firstpage
200
Lastpage
207
Abstract
Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information "for free\´\´ by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.
Keywords
data flow analysis; queueing theory; relational databases; storage management; buffer overflow; data buffering; data-intensive scientific workflow; dataflow-based model; in-memory queues; relational database; storage technology; Buffer storage; Computational modeling; Parallel processing; Pipelines; Relational databases; Schedules; Actor-Oriented Modeling; Dataflow; Pipeline Parallelism; Scientific Workflows;
fLanguage
English
Publisher
ieee
Conference_Titel
Services (SERVICES), 2011 IEEE World Congress on
Conference_Location
Washington, DC
Print_ISBN
978-1-4577-0879-4
Electronic_ISBN
978-0-7695-4461-8
Type
conf
DOI
10.1109/SERVICES.2011.57
Filename
6012713
Link To Document