• DocumentCode
    2907359
  • Title

    MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce

  • Author

    He, Yaobin ; Tan, Haoyu ; Luo, Wuman ; Mao, Huajian ; Ma, Di ; Feng, Shengzhong ; Fan, Jianping

  • Author_Institution
    Shenzhen Inst. of Adv. Technol., Shenzhen, China
  • fYear
    2011
  • fDate
    7-9 Dec. 2011
  • Firstpage
    473
  • Lastpage
    480
  • Abstract
    Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in kinds of data process fields. In this paper, we propose an efficient parallel density-based clustering algorithm and implement it by a 4-stages MapReduce paradigm. Furthermore, we adopt a quick partitioning strategy for large scale non-indexed data. We study the metric of merge among bordering partitions and make optimizations on it. At last, we evaluate our work on real large scale datasets using Hadoop platform. Results reveal that the speedup and scale up of our work are very efficient.
  • Keywords
    data mining; parallel programming; pattern clustering; Hadoop platform; MR-DBSCAN; data clustering; data mining; large scale nonindexed data; parallel density-based clustering algorithm MapReduce; parallel programming; partitioning strategy; Algorithm design and analysis; Clustering algorithms; Indexes; Merging; Partitioning algorithms; Silicon; Spatial databases; DBSCAN; MapReduce; data mining; parallel system;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on
  • Conference_Location
    Tainan
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4577-1875-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2011.83
  • Filename
    6121313