Title :
Performance Evaluation of Yahoo! S4: A First Look
Author :
Chauhan, Jagmohan ; Chowdhury, Shaiful Alam ; Makaroff, Dwight
Author_Institution :
Dept. of Comput. Sci., Univ. of Saskatchewan, Saskatoon, SK, Canada
Abstract :
Processing large data sets has been dominated recently by the map/reduce programming model [1], originally proposed by Google and widely adopted through the Apache Hadoop1 implementation. Over the years, developers have identified weaknesses of processing data sets in batches as in MapReduce and have proposed alternatives. One such alternative is continuous processing of data streams. This is particularly suitable for applications in online analytics, monitoring, financial data processing and fraud detection that require timely processing of data, making the delay introduced by batch processing highly undesirable. This processing paradigm has led to the development of systems such as Yahoo! S4 [2] and Twitter Storm.2 Yahoo! S4 is a general-purpose, distributed and scalable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. As these frameworks are quite young and new, there is a need to understand their performance for real time applications and find out the existing issues in terms of scalability, execution time and fault tolerance. We did an empirical evaluation of one application on Yahoo! S4 and focused on the performance in terms of scalability, lost events and fault tolerance. Findings of our analyses can be helpful towards understanding the challenges in developing stream-based data intensive computing tools and thus providing a guideline for the future development.
Keywords :
data handling; distributed programming; public domain software; software fault tolerance; software performance evaluation; Apache Hadoop1 implementation; Google; MapReduce programming model; Twitter Storm; Yahoo! S4 performance evaluation; batch processing; continuous unbounded data stream processing; general-purpose distributed scalable platform; online analytics; online financial data processing; online fraud detection; online monitoring; real time applications; stream-based data intensive computing tools; Computational modeling; Data models; Data processing; Fault tolerance; Fault tolerant systems; Real-time systems; Scalability; Performance; Stream-based Data Intensive computing; Yahoo! S4;
Conference_Titel :
P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2012 Seventh International Conference on
Conference_Location :
Victoria, BC
Print_ISBN :
978-1-4673-2991-0
DOI :
10.1109/3PGCIC.2012.55