DocumentCode :
58018
Title :
Multi-Threaded Hierarchical Clustering by Parallel Nearest-Neighbor Chaining
Author :
Yongkweon Jeon ; Sungroh Yoon
Author_Institution :
Dept. of Electr. & Comput. Eng., Seoul Nat. Univ., Seoul, South Korea
Volume :
26
Issue :
9
fYear :
2015
fDate :
Sept. 1 2015
Firstpage :
2534
Lastpage :
2548
Abstract :
Hierarchical agglomerative clustering (HAC) is a clustering method widely used in various disciplines from astronomy to zoology. HAC is useful for discovering hierarchical structure embedded in input data. The cost of executing HAC on large data is typically high, due to the need for maintaining global inter-cluster distance information throughout the execution. To address this issue, we propose a new parallelization scheme for multi-threaded shared-memory machines based on the concept of nearest-neighbor (NN) chains. The proposed multi-threaded algorithm allocates available threads into two groups, one for managing NN chains and the other for updating distance information. In-depth analysis of our approach gives insight into the ideal configuration of threads and theoretical performance bounds. We evaluate our proposed method by testing it with multiple public datasets and comparing its performance with that of several alternatives. In our test, the proposed method completes hierarchical clustering 3.09-51.79 times faster than the alternatives. Our test results also reveal the effects of performance-limiting factors such as starvation in chain growing, overhead incurred from using synchronization locks, and hardware aspects including memory-bandwidth saturation. According to our evaluation, the proposed scheme is effective in improving the HAC algorithm, achieving significant gains over the alternatives in terms of runtime and scalability.
Keywords :
multi-threading; pattern clustering; resource allocation; shared memory systems; HAC algorithm; hierarchical agglomerative clustering; hierarchical structure discovery; in-depth analysis; memory-bandwidth saturation; multithreaded hierarchical clustering; multithreaded shared memory machine; parallel nearest neighbor chaining; parallelization scheme; performance bound; performance limiting factors; runtime; scalability; synchronization locks; thread allocation; Algorithm design and analysis; Clustering algorithms; Clustering methods; Couplings; Instruction sets; Merging; Hierarchical clustering; multi-core CPU; multi-threading; parallelization; unsupervised learning;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2014.2355205
Filename :
6893001
Link To Document :
بازگشت