• DocumentCode
    2655231
  • Title

    ACID: Adaptive, convergent, and intelligent fault monitoring for distributed systems

  • Author

    Hussain, Shujaat ; Qadir, Muhammad Abdul

  • Author_Institution
    Center for Distrib. & Semantic Comput., Mohammad Ali Jinnah Univ., Islamabad
  • fYear
    2008
  • fDate
    18-19 Oct. 2008
  • Firstpage
    126
  • Lastpage
    131
  • Abstract
    Fault monitoring is an important issue to be addressed for fault tolerant distributed system. With the help of an efficient fault monitoring scheme, it would be easy to determine the crash and quickly take the recovery steps. Fault monitor typically detects faults by sending and receiving messages to remote objects. One of the major responsibilities of the monitor is to adapt timeouts according to the dynamic network and system conditions, and set timeouts very close to the real delays in the system. The timeouts must not fluctuate with large amplitudes around the actual time delays. It should not adapt with sudden transients behaviors. Otherwise the number of false alarms would increase, which may trigger a heavy fault recovery mechanisms. The relationship between timeouts and monitoring intervals need to be managed intelligently. Our technique adapts the timeout on the previous history which gives us a fair idea about the work load and we use it to our advantage. When we tested the existing schemes against the three points just mentioned, to our surprise, none of the scheme complies with these points. We experimented with our technique along with some other proposed techniques, our scheme; ACID gave very good results when compared with the schemes.
  • Keywords
    delays; fault diagnosis; fault tolerant computing; system recovery; failure detection; false fault recovery mechanisms; fault monitoring; fault tolerant distributed system; time delays; timeouts; Computer crashes; Condition monitoring; Delay effects; Delay systems; Fault detection; Fault tolerant systems; History; Object detection; Remote monitoring; Testing; Fault Tolerance; failure detection; fault monitoring; monitoring interval; timeout;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Emerging Technologies, 2008. ICET 2008. 4th International Conference on
  • Conference_Location
    Rawalpindi
  • Print_ISBN
    978-1-4244-2210-4
  • Electronic_ISBN
    978-1-4244-2211-1
  • Type

    conf

  • DOI
    10.1109/ICET.2008.4777487
  • Filename
    4777487