DocumentCode
610406
Title
SASH: Enabling continuous incremental analytic workflows on Hadoop
Author
Sethi, M. ; Sachindran, N. ; Raghavan, Srinath
Author_Institution
IBM India Res. Lab., Bangalore, India
fYear
2013
fDate
8-12 April 2013
Firstpage
1219
Lastpage
1230
Abstract
There is an emerging class of enterprise applications in areas such as log data analysis, information discovery, and social media marketing that involve analytics over large volumes of unstructured and semi-structured data. These applications are leveraging new analytics platforms based on the MapReduce framework and its open source Hadoop implementation. While this trend has engendered work on high-level data analysis languages, NoSQL data stores, workflow engines etc., there has been very little attention to the challenges of deploying analytic workflows into production for continuous operation. In this paper, we argue that an essential platform component for enabling continuous production analytic workflows is an analytics store. We highlight five key requirements that impact the design of such a store: (i) efficient incremental operations, (ii) flexible storage model for hierarchical data, (iii) snapshot support (iv) object-level incremental updates, and (v) support for handling change sets. We describe the design of SASH, a scalable analytics store that we have developed on top of HBase to address these requirements. Using the workload from a production workflow that powers search within IBM´s intranet and extranet, we demonstrate orders of magnitude improvement in IO performance using SASH.
Keywords
SQL; data analysis; data mining; data structures; high level languages; intranets; object-oriented programming; public domain software; social networking (online); storage management; workflow management software; HBase; IBM intranet and extranet; IO performance; MapReduce framework; NoSQL data stores; SASH; continuous incremental analytic workflows; continuous operation; continuous production analytic workflows; enterprise applications; flexible storage model; handling change sets support; hierarchical data; high-level data analysis languages; incremental operations; information discovery; log data analysis; object-level incremental updates; open source Hadoop implementation; platform component; production workflow; scalable analytics store; semistructured data; snapshot support; social media marketing; unstructured data; workflow engines; Computer architecture; Data models; Indexing; Libraries; Media; Production;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering (ICDE), 2013 IEEE 29th International Conference on
Conference_Location
Brisbane, QLD
ISSN
1063-6382
Print_ISBN
978-1-4673-4909-3
Electronic_ISBN
1063-6382
Type
conf
DOI
10.1109/ICDE.2013.6544911
Filename
6544911
Link To Document