مرکز منطقه ای اطلاع رساني علوم و فناوري - An extended HDFS with an AVATAR NODE to handle both small files and to eliminate single point of failure

Abstract :

Hadoop is a software framework which is an open source that supports big (large datasets of) data in distributed environment. Hadoop creates a cluster of machines and coordinates the work among them. It has two major components HDFS and Map Reduce. HDFS is mainly designed to store the large amount of data ie reliable and also provide high availability of data to user application which is running at client side. Multiple data blocks are created by HDFS and to enable reliable, extreme rapid computation each of the duplicate block is stored across the number of servers as a pool. All the files in HDFS are managed by a single server, the `Name Node´. Name Node stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance penalty with increased number of small files. Storing and managing a large number of small files imposes a heavy burden on the Name Node. The number of files that can be stored into HDFS is constrained by the size of Name Node´s main memory. Further, HDFS does not take the correlation among files into account, and it does not provide any prefetching mechanism to improve the I/O performance. In order to improve the efficiency of storing and accessing the small files on HDFS, a solution based on the works of Dong.et.al, namely Extended Hadoop Distributed File System (EHDFS) is used. In this approach, a set of correlated files is combined, as identified by the client, into a single large file to reduce the file count. So, An indexing mechanism has been built by Chandrasekar S.et al to access the individual files from the corresponding combined file. Further, index prefetching is also provided to improve I/O performance and minimize the load on Name Node. Also, For removing the single point of failure and increasing the storage capacity of the architecture, `Avatar Node´ is used as a solution given by S. Chandra Mouliswaran.et.al. So, this paper focuses on increasing the `efficiency´ of the indexing mechanism for handling `Small files´ in HDFS by adding the concept of `Avatar node´ that handle the problem in case of failover of a primary Name Node.