مرکز منطقه ای اطلاع رساني علوم و فناوري - ParaMASK: A Multi-Agent System for the efficient and dynamic adaptation of HPC workloads

Abstract :

The growing parallelism and heterogeneity of modern computing infrastructures such as High Performance Computing (HPC) platforms raises new challenges to their programmers and users. Additional requirements have emerged nowadays, such as minimizing the consumed energy, reducing the utilized system resources, or providing built-in reliability mechanisms. Therefore High Performance Computing (HPC) applications require adaptation mechanisms and then must avoid traditional monolithic centralized approaches in favor of novel autonomous, flexible and decentralized decision systems. In this context, we describe here a dynamic and flexible adaptation scheme based on a Multi-Agent System (MAS) to handle parallel or distributed executions in an HPC environment. More precisely, we model and extend the existing HPC middleware Kaapi to offer the power of the ParaMoise multi-agent organizational framework. Our proposed solution, named ParaMASK, relies on the similarities between ParaMoise workflow-based functional specifications and the Direct Acyclic Graph (DAG) representation of the distributed execution within Kaapi. As a result, ParaMASK permits to analyze and reorganize the scheduling of tasks that compose a program in an autonomous and decentralized way, while additionally handling dynamic adaptations (using task migration to fulfill energy consumption goals for example). The proposed solution was implemented on top of the existing Kaapi middleware and includes an optimized algorithm for the agent coordination. ParaMASK has been validated with a series of experiments on a real computational grid. Experimental results show a good scalability and an exceptional low overhead induced by the approach: less than 1.5% execution time increase with periodic coordinations every 15 seconds on 2662 cores.