DocumentCode :
597715
Title :
A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System
Author :
Chandrasekar, S. ; Dakshinamurthy, R. ; Seshakumar, P.G. ; Prabavathy, B. ; Babu, Chitra
Author_Institution :
Dept. of Comput. Sci. & Eng., SSN Coll. of Eng., Kalavakkam, India
fYear :
2013
fDate :
4-6 Jan. 2013
Firstpage :
1
Lastpage :
8
Abstract :
Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance penalty with increased number of small files. Storing and managing a large number of small files imposes a heavy burden on the NameNode. The number of files that can be stored into HDFS is constrained by the size of NameNode´s main memory. Further, HDFS does not take the correlation among files into account, and it does not provide any prefetching mechanism to improve the I/O performance. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose a solution based on the works of Dong et al., namely Extended Hadoop Distributed File System (EHDFS). In this approach, a set of correlated files is combined, as identified by the client, into a single large file to reduce the file count. An indexing mechanism has been built to access the individual files from the corresponding combined file. Further, index prefetching is also provided to improve I/O performance and minimize the load on NameNode. The experimental results indicate that EHDFS is able to reduce the metadata footprint on NameNode´s main memory by 16% and also improve the efficiency of storing and accessing large number of small files.
Keywords :
correlation methods; distributed databases; indexing; meta data; network operating systems; public domain software; storage management; EHDFS; HDFS; I/O performance improvement; NameNode main memory; extended Hadoop distributed file system; file correlation; file count reduction; file management; file storage; index prefetching; indexing mechanism; load minimization; metadata footprint reduction; Computer architecture; Computers; File systems; Indexes; Informatics; Merging; Prefetching; extended hdfs; file correlation; hadoop distributed file system; indexing; prefetching; small file;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Communication and Informatics (ICCCI), 2013 International Conference on
Conference_Location :
Coimbatore
Print_ISBN :
978-1-4673-2906-4
Type :
conf
DOI :
10.1109/ICCCI.2013.6466147
Filename :
6466147
Link To Document :
بازگشت