Title :
Scalable parallel OPTICS data clustering using graph algorithmic techniques
Author :
Ali Patwary, Md Mostofa ; Palsetia, Diana ; Agrawal, Ankit ; Wei-keng Liao ; Manne, Fredrik ; Choudhary, Alok
Author_Institution :
Northwestern Univ., Evanston, IL, USA
Abstract :
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM´s Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
Keywords :
data mining; data structures; parallel algorithms; pattern clustering; reachability analysis; trees (mathematics); MPI; OpenMP; POPTICS; PRIM minimum spanning tree algorithm; adjustable reachability distance thresholds; arbitrary-shaped clusters; data mining technique; disjoint-set data structure; distributed cluster extraction; distributed-memory machine; floating point numbers; graph algorithmic techniques; hierarchical density-based data clustering algorithm; high dimensional datasets; noise elimination; scalable parallel OPTICS algorithm; scalable parallel OPTICS data clustering; sequential data access order; shared-memory machine; Algorithm design and analysis; Clustering algorithms; Complexity theory; Data structures; Noise; Optics; Vegetation; Density-based clustering; Disjoint-set data structure; Minimum spanning tree; Union-Find algorithm;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
Conference_Location :
Denver, CO
Print_ISBN :
978-1-4503-2378-9
DOI :
10.1145/2503210.2503255