DocumentCode
179871
Title
Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism
Author
Cabello, Uriel ; Rodriguez, Jose ; Meneses, Amilcar ; Mendoza, Sergio ; Decouchant, Dominique
Author_Institution
Dept. of Comput. Sci., Center of Res. & Adv. Studies, Mexico City, Mexico
fYear
2014
fDate
Sept. 29 2014-Oct. 3 2014
Firstpage
1
Lastpage
7
Abstract
The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.
Keywords
fault tolerant computing; grid computing; parallel programming; Open MPI programming; OpenCL programming; data analysis; error handling; fault tolerance; grid computing paradigm; heterogeneous computer systems; heterogeneous distributed clusters; heterogeneous multi-cluster systems; heterogeneous network interfaces; hybrid programming model; task migration mechanism; Computational modeling; Fault tolerance; Fault tolerant systems; Kernel; Programming; Proposals;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical Engineering, Computing Science and Automatic Control (CCE), 2014 11th International Conference on
Conference_Location
Campeche
Print_ISBN
978-1-4799-6228-0
Type
conf
DOI
10.1109/ICEEE.2014.6978266
Filename
6978266
Link To Document