• DocumentCode
    3435171
  • Title

    A Hybrid Approach to High Availability in Stream Processing Systems

  • Author

    Zhang, Zhe ; Gu, Yu ; Ye, Fan ; Yang, Hao ; Kim, Minkyong ; Lei, Hui ; Liu, Zhen

  • Author_Institution
    Nat. Center for Comput. Sci., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2010
  • fDate
    21-25 June 2010
  • Firstpage
    138
  • Lastpage
    148
  • Abstract
    Stream processing is widely used by today´s applications such as financial data analysis and disaster response. In distributed stream processing systems, machine fail-stop events are handled by either active standby or passive standby. However, existing high availability (HA) schemes have not sufficiently addressed the situation when a machine becomes temporarily unavailable due to data rate spikes, intensive analysis or job sharing, which happens frequently but lasts for short time. It is not clear how well active and passive standby fare against such transient unavailability. In this paper, we first critically examine the suitability of active and passive standby against transient unavailability in a real testbed environment. We find that both approaches have advantages and drawbacks, but neither is ideal to provide fast recovery at low overhead as required to handle transient unavailability. Based on the insights gained, we propose a novel hybrid HA method that switches between active and passive standby modes depending on the occurrence of failure events. It presents a desirable tradeoff that is different from existing HA approaches: low overhead during normal conditions and fast recovery upon transient or permanent failure events. We have implemented our hybrid method and compared it with existing HA designs with comprehensive evaluation. The results show that our hybrid method can reduce two-thirds of the recovery time compared to passive standby and 80% message overhead compared to active standby, allowing applications to enjoy uninterrupted processing without paying a high premium.
  • Keywords
    failure analysis; reliability; system recovery; disaster response; distributed stream processing systems; financial data analysis; high availability schemes; machine fail-stop events; permanent failure events; transient failure events; Availability; Computer science; Data analysis; Delay; Distributed computing; Laboratories; Streaming media; Switches; Telecommunication traffic; Testing; high availability; hybrid method; stream processing systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on
  • Conference_Location
    Genova
  • ISSN
    1063-6927
  • Print_ISBN
    978-1-4244-7261-1
  • Type

    conf

  • DOI
    10.1109/ICDCS.2010.81
  • Filename
    5541698