مرکز منطقه ای اطلاع رساني علوم و فناوري - An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce

DocumentCode :

2962285

Title :

An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce

Author :

Sun, Tianyang ; Shu, Chengchun ; Li, Feng ; Yu, Haiyan ; Ma, Lili ; Fang, Yitong

Author_Institution :

Chinese Acad. of Sci., Grad. Univ., Beijing, China

fYear :

2009

fDate :

8-11 Dec. 2009

Firstpage :

494

Lastpage :

499

Abstract :

Large datasets become common in applications like Internet services, genomic sequence analysis and astronomical telescope. The demanding requirements of memory and computation power force data mining algorithms to be parallelized in order to efficiently deal with the large datasets. This paper introduces our experience of grouping internet users by mining a huge volume of Web access log of up to 100 gigabytes. The application is realized using hierarchical clustering algorithms with Map-Reduce, a parallel processing framework over clusters. However, the immediate implementation of the algorithms suffers from efficiency problem for both inadequate memory and higher execution time. This paper present an efficient hierarchical clustering method of mining large datasets with Map-Reduce. The method includes two optimization techniques: Â¿Batch UpdatingÂ¿ to reduce the computational time and communication costs among cluster nodes, and Â¿Co-occurrence based feature selectionÂ¿ to decrease the dimension of feature vectors and eliminate noise features. The empirical study shows the first technique can significantly reduce the IO and distributed communication overhead, reducing the total execution time to nearly 1/15. Experimentally, the second technique efficiently simplifies the features while obtains improved accuracy of hierarchical clustering.

Keywords :

Internet; data mining; optimisation; parallel processing; pattern clustering; Internet services; Map-Reduce; Web access log; astronomical telescope; batch updating optimization technique; computation power force data mining algorithms; cooccurrence based feature selection; distributed communication overhead; genomic sequence analysis; hierarchical clustering algorithm; internet users grouping; parallel processing framework; Bioinformatics; Clustering algorithms; Clustering methods; Concurrent computing; Data mining; Genomics; Optimization methods; Parallel processing; Telescopes; Web and internet services; Batch Updating; Hierarchical clustering; feature selection;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on

Conference_Location :

Higashi Hiroshima

Print_ISBN :

978-0-7695-3914-0

Type :

conf

DOI :

10.1109/PDCAT.2009.46

Filename :

5372757

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2962285