• DocumentCode
    3434237
  • Title

    AptStore: Dynamic Storage Management for Hadoop

  • Author

    Krish, K.R. ; Khasymski, Aleksandr ; Butt, Ali R. ; Tiwari, Sunita ; Bhandarkar, Milind

  • Volume
    1
  • fYear
    2013
  • fDate
    2-5 Dec. 2013
  • Firstpage
    33
  • Lastpage
    41
  • Abstract
    Typical Hadoop setups employ Direct Attached Storage (DAS) with compute nodes and uniform replication of data to sustain high I/O throughput and fault tolerance. However, not all data is accessed at the same time or rate. Thus, if a large replication factor is used to support higher throughput for popular data, it wastes storage by unnecessarily replicating unpopular data as well. Conversely, if less replication is used to conserve storage for the unpopular data, it means fewer replicas for even popular data and thus lower I/O throughput. We present Apt Store, a dynamic data management system for Hadoop, which aims to improve overall I/O throughput while reducing storage cost. We design a tiered storage that uses the standard DAS for popular data to sustain high I/O throughput, and network-attached enterprise filers for cost-effective, fault-tolerant, but lower-throughput storage for unpopular data. We design a file Popularity Prediction Algorithm (PPA) that analyzes file system audit logs and predicts the appropriate storage policy of each file, as well as use the information for transparent data movement between tiers. Our evaluation of Apt Store on a real cluster shows 21.3% improvement in application execution time over standard Hadoop, while trace driven simulations show 23.7% increase in read throughput and 43.4% reduction in the storage capacity requirement of the system.
  • Keywords
    distributed processing; public domain software; software fault tolerance; storage management; AptStore; DAS; Hadoop; Hadoop distributed file system; I/O throughput; PPA; application execution time; direct attached storage; dynamic data management system; dynamic storage management; fault tolerance; file popularity prediction algorithm; file system audit log analysis; network-attached enterprise filers; storage cost reduction; tiered storage design; trace driven simulations; transparent data movement; Algorithm design and analysis; Bandwidth; Fault tolerance; Fault tolerant systems; Prediction algorithms; Standards; Throughput;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on
  • Conference_Location
    Bristol
  • Type

    conf

  • DOI
    10.1109/CloudCom.2013.12
  • Filename
    6753775