DocumentCode :
3437021
Title :
CHAC: An Effective Attribute Clustering Algorithm for Large-Scale Data Processing
Author :
Gu, Xiaoyan ; Yang, Xiufeng ; Wang, Weiping ; Jin, Yan ; Meng, Dan
Author_Institution :
Inst. of Comput. Technol., Beijing, China
fYear :
2012
fDate :
28-30 June 2012
Firstpage :
94
Lastpage :
98
Abstract :
Nowadays Hadoop has become a leading architecture for large-scale data processing. One of the efficient ways to accelerate data processing is column-oriented storage technique which has been integrated into Hadoop family recently. However, how to design an appropriate attribute clustering algorithm to achieve optimal data processing performance for column-oriented hadoop system is still a big problem. In this paper, we propose a novel algorithm called CHAC to solve this problem. Both cases of overlapping attribute cluster and non-overlapping attribute cluster are considered in CHAC. In addition, an adjustable parameter is also taken into account to prohibit excessive attribute redundancy via limiting space overhead. The experimental results on TPC-H Benchmark demonstrate the efficiency and effectiveness of the proposed algorithm.
Keywords :
data handling; pattern clustering; query processing; storage management; CHAC; TPC-H Benchmark; attribute clustering algorithm; attribute redundancy; column-oriented Hadoop system; column-oriented storage technique; data processing acceleration; large-scale data processing; limiting space overhead; nonoverlapping attribute cluster; optimal data processing performance; query execution time; Algorithm design and analysis; Clustering algorithms; Data models; Database systems; Itemsets; Partitioning algorithms; Attribute Clustering; CHAC; Hadoop; Overlapping;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Networking, Architecture and Storage (NAS), 2012 IEEE 7th International Conference on
Conference_Location :
Xiamen, Fujian
Print_ISBN :
978-1-4673-1889-1
Type :
conf
DOI :
10.1109/NAS.2012.16
Filename :
6310881
Link To Document :
بازگشت