DBSCAN on Resilient Distributed Datasets

Author

Cordova, Irving ; Teng-Sheng Moh

Author_Institution

Dept. of Comput. Sci., San Jose State Univ., San José, CA, USA

fYear

2015

fDate

20-24 July 2015

Firstpage

531

Lastpage

540

Abstract

DBSCAN is a well-known density-based data clustering algorithm that is widely used due to its ability to find arbitrarily shaped clusters in noisy data. However, DBSCAN is hard to scale which limits its utility when working with large data sets. Resilient Distributed Datasets (RDDs), on the other hand, are a fast data-processing abstraction created explicitly for in-memory computation of large data sets. This paper presents a new algorithm based on DBSCAN using the Resilient Distributed Datasets approach: RDD-DBSCAN. RDD-DBSCAN overcomes the scalability limitations of the traditional DBSCAN algorithm by operating in a fully distributed fashion. The paper also evaluates an implementation of RDD-DBSCAN using Apache Spark, the official RDD implementation.

Keywords

data handling; distributed processing; pattern clustering; Apache Spark; RDD-DBSCAN algorithm; arbitrarily shaped clusters; data-processing abstraction; density-based data clustering algorithm; in-memory computation; official RDD implementation; resilient distributed datasets approach; Clustering algorithms; Distributed databases; Machine learning algorithms; Noise; Partitioning algorithms; Prediction algorithms; Sparks; Apache Spark; DBSCAN; MapReduce; Resilient Distributed Datasets; data clustering; data partition; parallel systems;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing & Simulation (HPCS), 2015 International Conference on

Conference_Location

Amsterdam

Print_ISBN

978-1-4673-7812-3

Type

conf

DOI

10.1109/HPCSim.2015.7237086

Filename

7237086