مرکز منطقه ای اطلاع رساني علوم و فناوري - Problem Diagnosis in Large-Scale Computing Environments

DocumentCode :

3425636

Title :

Problem Diagnosis in Large-Scale Computing Environments

Author :

Mirgorodskiy, Alexander V. ; Maruyama, Naoya ; Miller, Barton P.

fYear :

2006

fDate :

11-17 Nov. 2006

Firstpage :

Lastpage :

Abstract :

We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore

Keywords :

parallel programming; program diagnostics; distributed cluster; distributed system; fail-stop problem; function-level traces; large-scale computing environment; problem diagnosis; Application software; Computer bugs; Data analysis; Distributed computing; High performance computing; Instruments; Large-scale systems; Performance analysis; Permission; Runtime;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

SC 2006 Conference, Proceedings of the ACM/IEEE

Conference_Location :

Tampa, FL

Print_ISBN :

0-7695-2700-0

Electronic_ISBN :

0-7695-2700-0

Type :

conf

DOI :

10.1109/SC.2006.50

Filename :

4090185

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3425636