• DocumentCode
    606243
  • Title

    Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension

  • Author

    Kala, K.A. ; Chitharanjan, K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Sree Chithra Thirunal Coll. of Eng., Thiruvananthapuram, India
  • fYear
    2013
  • fDate
    20-21 March 2013
  • Firstpage
    1243
  • Lastpage
    1249
  • Abstract
    Apache´s Hadoop is an open source framework for large scale data analysis and storage. It is an open source implementation of Google´s Map/Reduce framework. It enables distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation of the Google File System (GFS) for storing data. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that assumes commodity hardware and provides high throughput and fault tolerance. HDFS stores files as a series of blocks and are replicated for fault tolerance. The default block placement strategy doesn´t consider the data characteristics and places the data blocks randomly. Customized strategies can improve the performance of HDFS to a great extend. Applications using HDFS require streaming access to the files and if the related files are placed in the same set of data nodes, the performance can be increased. This paper is discussing about a method for clustering streaming data to the same set of data nodes using the technique of Locality Sensitive Hashing. The method utilizes the compact bitwise representation of document vectors called fingerprints created using the concept of Locality Sensitive Hashing to increase the data processing speed and performance. The process will be done without affecting the default fault tolerant properties of Hadoop and requires only minimal changes to the Hadoop framework.
  • Keywords
    data analysis; data structures; distributed databases; document handling; network operating systems; pattern clustering; public domain software; random processes; software fault tolerance; GFS; Google MapReduce framework; Google file system; HDFS performance improvement; Hadoop distributed file system; affinity group creation; bitwise document vector representation; customized strategies; data intensive applications; data nodes; data processing performance; data processing speed; default block placement strategy; distributed applications; fault tolerance; fingerprints; incremental clustering; large distributed file system; large scale data analysis; large scale data storage; locality sensitive hashing; massive data set; open source framework; parallel applications; random data block placement; streaming data clustering; streaming file access; Fingerprint; HDFS; Hadoop; Locality Sensitive Hashing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Circuits, Power and Computing Technologies (ICCPCT), 2013 International Conference on
  • Conference_Location
    Nagercoil
  • Print_ISBN
    978-1-4673-4921-5
  • Type

    conf

  • DOI
    10.1109/ICCPCT.2013.6528999
  • Filename
    6528999