Title :
Meteor Shower: A Reliable Stream Processing System for Commodity Data Centers
Author :
Wang, Huayong ; Peh, Li-Shiuan ; Koukoumidis, Emmanouil ; Tao, Shao ; Chan, Mun Choon
Author_Institution :
Comput. Sci. & Artificial Intell. Lab., MIT, Cambridge, MA, USA
Abstract :
Large-scale failures are commonplace in commodity data centers, the major platforms for Distributed Stream Processing Systems (DSPSs). Yet, most DSPSs can only handle single-node failures. Here, we propose Meteor Shower, a new fault-tolerant DSPS that overcomes large-scale burst failures while improving overall performance. Meteor Shower is based on checkpoints. Unlike previous schemes, Meteor Shower orchestrates operators´ check pointing activities through tokens. The tokens originate from source operators, trickle down the stream graph, triggering each operator that receives these tokens to checkpoint its own state. Meteor Shower is a suite of three new techniques: 1) source preservation, 2) parallel, asynchronous check pointing, and 3) application-aware check pointing. Source preservation allows Meteor Shower to avoid the overhead of redundant tuple saving in prior schemes, parallel, asynchronous check pointing enables Meter Shower operators to continue processing streams during a checkpoint, while application-aware check pointing lets Meteor Shower learn the changing pattern of operators´ state size and initiate checkpoints only when the state size is minimal. All three techniques together enable Meteor Shower to improve throughput by 226% and lower latency by 57% vs prior state-of-the-art. Our results were measured on a prototype implementation running three real world applications in the Amazon EC2 Cloud.
Keywords :
checkpointing; cloud computing; computer centres; fault tolerant computing; graph theory; parallel processing; Amazon EC2 Cloud; Meteor shower; application-aware check pointing; commodity data centers; distributed stream processing systems; fault-tolerant DSPS; large-scale burst failures; latency; operator check pointing activities; operator state size changing pattern learning; parallel asynchronous check pointing; reliable stream processing system; single-node failures; source preservation; stream graph; throughput; tokens; Cameras; Checkpointing; Digital signal processing; Fault tolerance; Fault tolerant systems; Throughput; fault tolerance; reliability; stream computing;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.108