DocumentCode :
3673296
Title :
ParaMASK: A Multi-Agent System for the efficient and dynamic adaptation of HPC workloads
Author :
Mateusz Guzek;Xavier Besseron;Sébastien Varrette;Grégoire Danoy;Pascal Bouvry
Author_Institution :
Interdisciplinary Centre for Security Reliability and Trust, 6, rue Richard Coudenhove-Kalergi, L-1359 Luxembourg, Luxembourg
fYear :
2014
Firstpage :
275
Lastpage :
281
Abstract :
The growing parallelism and heterogeneity of modern computing infrastructures such as High Performance Computing (HPC) platforms raises new challenges to their programmers and users. Additional requirements have emerged nowadays, such as minimizing the consumed energy, reducing the utilized system resources, or providing built-in reliability mechanisms. Therefore High Performance Computing (HPC) applications require adaptation mechanisms and then must avoid traditional monolithic centralized approaches in favor of novel autonomous, flexible and decentralized decision systems. In this context, we describe here a dynamic and flexible adaptation scheme based on a Multi-Agent System (MAS) to handle parallel or distributed executions in an HPC environment. More precisely, we model and extend the existing HPC middleware Kaapi to offer the power of the ParaMoise multi-agent organizational framework. Our proposed solution, named ParaMASK, relies on the similarities between ParaMoise workflow-based functional specifications and the Direct Acyclic Graph (DAG) representation of the distributed execution within Kaapi. As a result, ParaMASK permits to analyze and reorganize the scheduling of tasks that compose a program in an autonomous and decentralized way, while additionally handling dynamic adaptations (using task migration to fulfill energy consumption goals for example). The proposed solution was implemented on top of the existing Kaapi middleware and includes an optimized algorithm for the agent coordination. ParaMASK has been validated with a series of experiments on a real computational grid. Experimental results show a good scalability and an exceptional low overhead induced by the approach: less than 1.5% execution time increase with periodic coordinations every 15 seconds on 2662 cores.
Keywords :
"Organizations","Monitoring","Computational modeling","Scheduling","System-on-chip"
Publisher :
ieee
Conference_Titel :
Signal Processing and Information Technology (ISSPIT), 2014 IEEE International Symposium on
ISSN :
2162-7843
Type :
conf
DOI :
10.1109/ISSPIT.2014.7300600
Filename :
7300600
Link To Document :
بازگشت