• DocumentCode
    625669
  • Title

    Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

  • Author

    Bicer, Tekin ; Jian Yin ; Chiu, Dereck ; Agrawal, Gagan ; Schuchardt, Karen

  • Author_Institution
    Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    1205
  • Lastpage
    1216
  • Abstract
    Compute cycles in high performance systems are increasing at a much faster pace than both storage and wide-area bandwidths. To continue improving the performance of large-scale data analytics applications, compression has therefore become promising approach. In this context, this paper makes the following contributions. First, we develop a new compression methodology, which exploits the similarities between spatial and/or temporal neighbors in a popular climate simulation dataset and enables high compression ratios and low decompression costs. Second, we develop a framework that can be used to incorporate a variety of compression and decompression algorithms. This framework also supports a simple API to allow integration with an existing application or data processing middleware. Once a compression algorithm is implemented, this framework automatically mechanizes multi-threaded retrieval, multi-threaded data decompression, and the use of informed prefetching and caching. By integrating this framework with a data-intensive middleware, we have applied our compression methodology and framework to three applications over two datasets, including the Global Cloud-Resolving Model (GCRM) climate dataset. We obtained an average compression ratio of 51.68%, and up to 53.27% improvement in execution time of data analysis applications by amortizing I/O time by moving compressed data.
  • Keywords
    application program interfaces; cache storage; climatology; data analysis; data compression; geophysics computing; information retrieval; middleware; multi-threading; API; GCRM; caching; climate simulation dataset; compression ratios; data processing middleware; data-intensive middleware; decompression costs; global cloud-resolving model climate dataset; high performance systems; informed prefetching; large-scale data analytics applications; multithreaded data decompression; multithreaded retrieval; online compression; wide-area bandwidths; Compression algorithms; Computational modeling; Data models; Meteorology; Middleware; Prefetching; Compression; Data management; Data-intensive computing; Map-Reduce;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.81
  • Filename
    6569897