• DocumentCode
    2028349
  • Title

    DBSCAN on Resilient Distributed Datasets

  • Author

    Cordova, Irving ; Teng-Sheng Moh

  • Author_Institution
    Dept. of Comput. Sci., San Jose State Univ., San José, CA, USA
  • fYear
    2015
  • fDate
    20-24 July 2015
  • Firstpage
    531
  • Lastpage
    540
  • Abstract
    DBSCAN is a well-known density-based data clustering algorithm that is widely used due to its ability to find arbitrarily shaped clusters in noisy data. However, DBSCAN is hard to scale which limits its utility when working with large data sets. Resilient Distributed Datasets (RDDs), on the other hand, are a fast data-processing abstraction created explicitly for in-memory computation of large data sets. This paper presents a new algorithm based on DBSCAN using the Resilient Distributed Datasets approach: RDD-DBSCAN. RDD-DBSCAN overcomes the scalability limitations of the traditional DBSCAN algorithm by operating in a fully distributed fashion. The paper also evaluates an implementation of RDD-DBSCAN using Apache Spark, the official RDD implementation.
  • Keywords
    data handling; distributed processing; pattern clustering; Apache Spark; RDD-DBSCAN algorithm; arbitrarily shaped clusters; data-processing abstraction; density-based data clustering algorithm; in-memory computation; official RDD implementation; resilient distributed datasets approach; Clustering algorithms; Distributed databases; Machine learning algorithms; Noise; Partitioning algorithms; Prediction algorithms; Sparks; Apache Spark; DBSCAN; MapReduce; Resilient Distributed Datasets; data clustering; data partition; parallel systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing & Simulation (HPCS), 2015 International Conference on
  • Conference_Location
    Amsterdam
  • Print_ISBN
    978-1-4673-7812-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2015.7237086
  • Filename
    7237086