• DocumentCode
    610429
  • Title

    Tajo: A distributed data warehouse system on large clusters

  • Author

    Hyunsik Choi ; Jihoon Son ; Haemi Yang ; Hyoseok Ryu ; Byungnam Lim ; SooHyung Kim ; Yon Dohn Chung

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Korea Univ., Seoul, South Korea
  • fYear
    2013
  • fDate
    8-12 April 2013
  • Firstpage
    1320
  • Lastpage
    1323
  • Abstract
    The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.
  • Keywords
    Internet; data warehouses; directed graphs; distributed control; information dissemination; parallel databases; pattern clustering; pipeline processing; query processing; relational databases; storage management; user interfaces; DAG; HDFS; Hadoop distributed file system; Hive; MapReduce framework; TPC-H queries; Tajo architecture; Tajo cluster; Tajo cluster nodes; Web-based user interface; control distributed data flow; database community; directed acyclic graph; large clusters; local query engine; parallel databases; physical operators; query execution engine; query planning; query processing; relational data volumes; relational distributed data warehouse system; shared-nothing clusters; storage layer; suboptimal execution strategies; worker coordinator; Catalogs; Engines; Fault tolerance; Fault tolerant systems; Planning; Query processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2013 IEEE 29th International Conference on
  • Conference_Location
    Brisbane, QLD
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-4909-3
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2013.6544934
  • Filename
    6544934