DocumentCode :
267081
Title :
Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization
Author :
Kebing Wang ; Zhaojuan Bian ; Qian Chen ; Ren Wang ; Gen Xu
Author_Institution :
Software & Service Group, Intel Corp., Shanghai, China
fYear :
2014
fDate :
15-18 Dec. 2014
Firstpage :
475
Lastpage :
482
Abstract :
In the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system. In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.
Keywords :
Big Data; Java; data analysis; digital simulation; distributed databases; operating systems (computers); optimisation; planning; query processing; storage management; virtual machines; Big Data; CSMethod; HIVE data distributions; HIVE data models; Hive cluster simulation; Hive query execution life cycle; Hive query performance; JVM; Java virtual machine; MapReduce framework; MapReduce task execution; OS; business query response time estimation; capacity planning; cluster hardware configurations; dataset analysis; dataset management; deployment planning; distributed database system; distributed storage systems; evaluation; hardware diversity; optimization; query plan generation; software stack complexity; unavoidable data skew; Analytical models; Computational modeling; Data models; Engines; Hardware; Software; Time factors; Hive query simulation; big data; cluster simulation; data center capacity planning; performance modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on
Conference_Location :
Singapore
Type :
conf
DOI :
10.1109/CloudCom.2014.119
Filename :
7037705
Link To Document :
بازگشت