Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension

Author

Kala, K.A. ; Chitharanjan, K.

Author_Institution

Dept. of Comput. Sci. & Eng., Sree Chithra Thirunal Coll. of Eng., Thiruvananthapuram, India

fYear

2013

fDate

20-21 March 2013

Firstpage

1243

Lastpage

1249

Abstract

Apache´s Hadoop is an open source framework for large scale data analysis and storage. It is an open source implementation of Google´s Map/Reduce framework. It enables distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation of the Google File System (GFS) for storing data. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that assumes commodity hardware and provides high throughput and fault tolerance. HDFS stores files as a series of blocks and are replicated for fault tolerance. The default block placement strategy doesn´t consider the data characteristics and places the data blocks randomly. Customized strategies can improve the performance of HDFS to a great extend. Applications using HDFS require streaming access to the files and if the related files are placed in the same set of data nodes, the performance can be increased. This paper is discussing about a method for clustering streaming data to the same set of data nodes using the technique of Locality Sensitive Hashing. The method utilizes the compact bitwise representation of document vectors called fingerprints created using the concept of Locality Sensitive Hashing to increase the data processing speed and performance. The process will be done without affecting the default fault tolerant properties of Hadoop and requires only minimal changes to the Hadoop framework.

Keywords

data analysis; data structures; distributed databases; document handling; network operating systems; pattern clustering; public domain software; random processes; software fault tolerance; GFS; Google MapReduce framework; Google file system; HDFS performance improvement; Hadoop distributed file system; affinity group creation; bitwise document vector representation; customized strategies; data intensive applications; data nodes; data processing performance; data processing speed; default block placement strategy; distributed applications; fault tolerance; fingerprints; incremental clustering; large distributed file system; large scale data analysis; large scale data storage; locality sensitive hashing; massive data set; open source framework; parallel applications; random data block placement; streaming data clustering; streaming file access; Fingerprint; HDFS; Hadoop; Locality Sensitive Hashing;

fLanguage

English

Publisher

ieee

Conference_Titel

Circuits, Power and Computing Technologies (ICCPCT), 2013 International Conference on

Conference_Location

Nagercoil

Print_ISBN

978-1-4673-4921-5

Type

conf

DOI

10.1109/ICCPCT.2013.6528999

Filename

6528999

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=606243