• DocumentCode
    3203073
  • Title

    Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows

  • Author

    Agun, Michael ; Bowers, Shawn

  • fYear
    2011
  • fDate
    4-9 July 2011
  • Firstpage
    200
  • Lastpage
    207
  • Abstract
    Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information "for free\´\´ by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.
  • Keywords
    data flow analysis; queueing theory; relational databases; storage management; buffer overflow; data buffering; data-intensive scientific workflow; dataflow-based model; in-memory queues; relational database; storage technology; Buffer storage; Computational modeling; Parallel processing; Pipelines; Relational databases; Schedules; Actor-Oriented Modeling; Dataflow; Pipeline Parallelism; Scientific Workflows;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Services (SERVICES), 2011 IEEE World Congress on
  • Conference_Location
    Washington, DC
  • Print_ISBN
    978-1-4577-0879-4
  • Electronic_ISBN
    978-0-7695-4461-8
  • Type

    conf

  • DOI
    10.1109/SERVICES.2011.57
  • Filename
    6012713