• DocumentCode
    1398532
  • Title

    Data Replication in Data Intensive Scientific Applications with Performance Guarantee

  • Author

    Nukarapu, Dharma Teja ; Tang, Bin ; Wang, Liqiang ; Lu, Shiyong

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Wichita State Univ., Wichita, KS, USA
  • Volume
    22
  • Issue
    8
  • fYear
    2011
  • Firstpage
    1299
  • Lastpage
    1306
  • Abstract
    Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data Grids.
  • Keywords
    cache storage; computational complexity; data analysis; data reduction; distributed processing; grid computing; GridSim; NP-hard; bandwidth consumption; data file access delay; data file transfer time reduction; data grids; data intensive scientific applications; data replication algorithm; distributed caching algorithm; distributed caching technique; distributed environment; distributed grid simulator; file access patterns; intuitive heuristics; optimal replication solution; polynomial time centralized replication algorithm; popular file caching technique; theoretical performance guarantee; Algorithm design and analysis; Bandwidth; Computational modeling; Data models; Distributed databases; Greedy algorithms; Heuristic algorithms; Data Grids; Data intensive applications; algorithm design and analysis; data replication; simulations.;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2010.207
  • Filename
    5661771