DocumentCode :
656215
Title :
Pipelining/Overlapping Data Transfer for Distributed Data-Intensive Job Execution
Author :
Eun-Sung Jung ; Maheshwari, Ketan ; Kettimuthu, Rajkumar
Author_Institution :
Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
fYear :
2013
fDate :
1-4 Oct. 2013
Firstpage :
791
Lastpage :
797
Abstract :
Scientific workflows are increasingly gaining attention as both data and compute resources are getting bigger, heterogeneous, and distributed. Many scientific workflows are both compute intensive and data intensive and use distributed resources. This situation poses significant challenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These challenges will be exacerbated in the exascale era. Parallel jobs in scientific workflows are common, and such parallelism can be exploited by scheduling parallel jobs among multiple execution sites for enhanced performance. Previous scheduling algorithms such as heterogeneous earliest finish time (HEFT) did not focus on scheduling thousands of jobs often seen in contemporary applications. Some techniques, such as task clustering, have been proposed to reduce the overhead of scheduling a large number of jobs. However, scheduling massively parallel jobs in distributed environments poses new challenges as data movement becomes a nontrivial factor. We propose efficient parallel execution models through pipelined execution of data transfer, incorporating network bandwidth and reserved resources at an execution site. We formally analyze those models and suggest the best model with the optimal degree of parallelism. We implement our model in the Swift parallel scripting paradigm using GridFTP. Experiments on real distributed computing resources show that our model with optimal degrees of parallelism outperform the current parallel execution model by as much as 50% reduction of total execution time.
Keywords :
electronic data interchange; natural sciences computing; parallel processing; pipeline processing; scheduling; workflow management software; HEFT; Swift parallel scripting paradigm; data movement; data transfer pipelining; distributed computing resources; distributed data-intensive job execution; exascale era; heterogeneous earliest finish time; massive dataset dissemination; massive dataset real-time remote analysis; overlapping data transfer; parallel execution models; parallel job scheduling; scientific workflows; task clustering; Computational modeling; Data transfer; Equations; Mathematical model; Pipeline processing; Silicon;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2013 42nd International Conference on
Conference_Location :
Lyon
ISSN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2013.93
Filename :
6687418
Link To Document :
بازگشت