مرکز منطقه ای اطلاع رساني علوم و فناوري - Cyclic Storage for Fault-Tolerant Distributed Executions

DocumentCode :

1080789

Title :

Cyclic Storage for Fault-Tolerant Distributed Executions

Author :

Marcelín-Jiménez, Ricardo ; Rajsbaum, Sergio ; Stevens, Brett

Author_Institution :

Departamento de Ingenieria Electrica, UAM-lztapalapa

Volume :

Issue :

fYear :

2006

Firstpage :

1028

Lastpage :

1036

Abstract :

Given a set V of active components in charge of a distributed execution, a storage scheme is a sequence B₀, B₁,..., B_b-1 of subsets of V, where successive global states are recorded. The subsets, also called blocks, have the same size and are scheduled according to some fixed and cyclic calendar of b steps. During the ith step, block B_i is selected. Each component takes a copy of its local state and sends it to one of the components in B_i, in such a way that each component stores (approximately) the same number of local states. Afterward, if a component of B_i crashes, all of its stored data is lost and the computation cannot continue. If there exists a block with no failed components in it, then a recent global state can be retrieved and the computation does not need to start over from the very beginning. The goal is to design storage schemes that tolerate as many crashes as possible, while trying to have each component participating in as few blocks as possible and, at the same time, working with large blocks (so that a component in a block stores a small number of local states). In this paper, several such schemes are described and compared in terms of these measures

Keywords :

checkpointing; distributed processing; fault tolerant computing; resource allocation; scheduling; storage management; checkpointing; crash-tolerant storage; cyclic storage scheme design; data backup; data mining; distributed system storage; fault-tolerant distributed execution; load balancing; network repository; task assignment; Bismuth; Calendars; Centralized control; Computer crashes; Computer networks; Data mining; Distributed control; Fault tolerance; Fault tolerant systems; Resumes; Load balancing and task assignment; checkpoint/restart; distributed applications; distributed systems; fault-tolerance; network repositories/data mining/backup.; storage/repositories;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2006.120

Filename :

1668066

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1080789