DocumentCode :
3144621
Title :
Semi-Streamed Index Join for near-real time execution of ETL transformations
Author :
Bornea, Mihaela A. ; Deligiannakis, Antonios ; Kotidis, Yannis ; Vassalos, Vasilis
Author_Institution :
Athens U. of Econ & Bus., Athens, Greece
fYear :
2011
fDate :
11-16 April 2011
Firstpage :
159
Lastpage :
170
Abstract :
Active data warehouses have emerged as a new business intelligence paradigm where data in the integrated repository is refreshed in near real-time. This shift of practices achieves higher consistency between the stored information and the latest updates, which in turn influences crucially the output of decision making processes. In this paper we focus on the changes required in the implementation of Extract Transform Load (ETL) operations which now need to be executed in an online fashion. In particular, the ETL transformations frequently include the join between an incoming stream of updates and a disk-resident table of historical data or metadata. In this context we propose a novel Semi-Streaming Index Join (SSIJ) algorithm that maximizes the throughput of the join by buffering stream tuples and then judiciously selecting how to best amortize expensive disk seeks for blocks of the stored relation among a large number of stream tuples. The relation blocks required for joining with the stream are loaded from disk based on an optimal plan. In order to maximize the utilization of the available memory space for performing the join, our technique incorporates a simple but effective cache replacement policy for managing the retrieved blocks of the relation. Moreover, SSIJ is able to adapt to changing characteristics of the stream (i.e. arrival rate, data distribution) by dynamically adjusting the allocated memory between the cached relation blocks and the stream. Our experiments with a variety of synthetic and real data sets demonstrate that SSIJ consistently outperforms the state-of-the-art algorithm in terms of the maximum sustainable throughput of the join while being also able to accommodate deadlines on stream tuple processing.
Keywords :
competitive intelligence; data warehouses; decision making; information storage; meta data; storage management; ETL transformation; SSIJ; active data warehouse; business intelligence paradigm; cache replacement policy; decision making process; disk resident table; extract transform load operation; integrated data repository; memory allocation; memory space; metadata; near real time execution; semistreamed index join algorithm; stream tuple buffering; Buffer storage; Data warehouses; Heuristic algorithms; Indexes; Memory management; Radiation detectors; Throughput;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
ISSN :
1063-6382
Print_ISBN :
978-1-4244-8959-6
Electronic_ISBN :
1063-6382
Type :
conf
DOI :
10.1109/ICDE.2011.5767906
Filename :
5767906
Link To Document :
بازگشت