Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Author

Monti, Henry M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.

Author_Institution

Dept. of Comput. Sci., Virginia Polytech. Inst. & State Univ., Blacksburg, VA, USA

Volume

22

Issue

8

fYear

2011

Firstpage

1307

Lastpage

1322

Abstract

Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center´s purge and users´ delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.

Keywords

computer centres; electronic data interchange; fault tolerant computing; parallel machines; peer-to-peer computing; scheduling; storage management; BitTorrent; HPC center scratch provisioning; PBS; center-user service level agreements; data deluge; data transfer tool; decentralized fault-tolerant delivery; delivery deadlines; end-to-end data path; exposure window; high-performance computing; high-throughput center storage system; landmark sites; offloading times; point-to-point transfers; production job scheduler; purge policy; resource consumption; scratch space; sophisticated end-user data services; supercomputer job log-driven simulations; timely result-data offloading; user serviceability; user-specified intermediate sites; Bandwidth; Collaboration; Fault tolerance; Fault tolerant systems; Monitoring; Schedules; Supercomputers; HPC center serviceability; High-performance data management; end-user data delivery; offloading; peer-to-peer.;

fLanguage

English

Journal_Title

Parallel and Distributed Systems, IEEE Transactions on

Publisher

ieee

ISSN

1045-9219

Type

jour

DOI

10.1109/TPDS.2010.190

Filename

5611504