مرکز منطقه ای اطلاع رساني علوم و فناوري - Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization

DocumentCode :

267081

Title :

Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization

Author :

Kebing Wang ; Zhaojuan Bian ; Qian Chen ; Ren Wang ; Gen Xu

Author_Institution :

Software & Service Group, Intel Corp., Shanghai, China

fYear :

2014

fDate :

15-18 Dec. 2014

Firstpage :

475

Lastpage :

482

Abstract :

In the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system. In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.

Keywords :

Big Data; Java; data analysis; digital simulation; distributed databases; operating systems (computers); optimisation; planning; query processing; storage management; virtual machines; Big Data; CSMethod; HIVE data distributions; HIVE data models; Hive cluster simulation; Hive query execution life cycle; Hive query performance; JVM; Java virtual machine; MapReduce framework; MapReduce task execution; OS; business query response time estimation; capacity planning; cluster hardware configurations; dataset analysis; dataset management; deployment planning; distributed database system; distributed storage systems; evaluation; hardware diversity; optimization; query plan generation; software stack complexity; unavoidable data skew; Analytical models; Computational modeling; Data models; Engines; Hardware; Software; Time factors; Hive query simulation; big data; cluster simulation; data center capacity planning; performance modeling;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on

Conference_Location :

Singapore

Type :

conf

DOI :

10.1109/CloudCom.2014.119

Filename :

7037705

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=267081