DocumentCode :
1362278
Title :
Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability
Author :
Monti, Henry M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.
Author_Institution :
Dept. of Comput. Sci., Virginia Polytech. Inst. & State Univ., Blacksburg, VA, USA
Volume :
22
Issue :
8
fYear :
2011
Firstpage :
1307
Lastpage :
1322
Abstract :
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center´s purge and users´ delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.
Keywords :
computer centres; electronic data interchange; fault tolerant computing; parallel machines; peer-to-peer computing; scheduling; storage management; BitTorrent; HPC center scratch provisioning; PBS; center-user service level agreements; data deluge; data transfer tool; decentralized fault-tolerant delivery; delivery deadlines; end-to-end data path; exposure window; high-performance computing; high-throughput center storage system; landmark sites; offloading times; point-to-point transfers; production job scheduler; purge policy; resource consumption; scratch space; sophisticated end-user data services; supercomputer job log-driven simulations; timely result-data offloading; user serviceability; user-specified intermediate sites; Bandwidth; Collaboration; Fault tolerance; Fault tolerant systems; Monitoring; Schedules; Supercomputers; HPC center serviceability; High-performance data management; end-user data delivery; offloading; peer-to-peer.;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2010.190
Filename :
5611504
Link To Document :
بازگشت