Tajo: A distributed data warehouse system on large clusters

Author

Hyunsik Choi ; Jihoon Son ; Haemi Yang ; Hyoseok Ryu ; Byungnam Lim ; SooHyung Kim ; Yon Dohn Chung

Author_Institution

Dept. of Comput. Sci. & Eng., Korea Univ., Seoul, South Korea

fYear

2013

fDate

8-12 April 2013

Firstpage

1320

Lastpage

1323

Abstract

The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

Keywords

Internet; data warehouses; directed graphs; distributed control; information dissemination; parallel databases; pattern clustering; pipeline processing; query processing; relational databases; storage management; user interfaces; DAG; HDFS; Hadoop distributed file system; Hive; MapReduce framework; TPC-H queries; Tajo architecture; Tajo cluster; Tajo cluster nodes; Web-based user interface; control distributed data flow; database community; directed acyclic graph; large clusters; local query engine; parallel databases; physical operators; query execution engine; query planning; query processing; relational data volumes; relational distributed data warehouse system; shared-nothing clusters; storage layer; suboptimal execution strategies; worker coordinator; Catalogs; Engines; Fault tolerance; Fault tolerant systems; Planning; Query processing;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering (ICDE), 2013 IEEE 29th International Conference on

Conference_Location

Brisbane, QLD

ISSN

1063-6382

Print_ISBN

978-1-4673-4909-3

Electronic_ISBN

1063-6382

Type

conf

DOI

10.1109/ICDE.2013.6544934

Filename

6544934