• DocumentCode
    260427
  • Title

    Assessing the Impact of Concurrent Replication with Canceling in Parallel Jobs

  • Author

    Zhan Qiu ; Perez, Juan F.

  • Author_Institution
    Dept. of Comput., Imperial Coll. London, London, UK
  • fYear
    2014
  • fDate
    9-11 Sept. 2014
  • Firstpage
    31
  • Lastpage
    40
  • Abstract
    Parallel job processing has become a key feature of many software applications, e.g., in scientific computing. Parallelization allows these applications to exploit large resource pools, such as cloud or grid data centers. However, a job composed of a large number of parallel tasks will suffer a failure if any of its tasks fail, requiring reprocessing and additional delays. In this paper, we explore the effect that the replication of parallel jobs has on the job reliability and response time, as well as on resource utilization. The replication mechanism consists of concurrently processing replicas, at either the job or the task level, retrieving the results of the replica that finishes first, if any, and canceling any remaining replica in process. We propose a stochastic model that explicitly considers parallel job processing, replication at both the job and the task level, and handles general arrival processes. We develop a numerically-efficient algorithm to solve large-scale instances of the model and compute key performance metrics. We observe that the task cancellation mechanism offers an effective way of limiting the increase in resource utilization, allowing the use of replicas that not only increase the job reliability, but have the potential to reduce the response times.
  • Keywords
    concurrency control; iterative methods; parallel processing; resource allocation; cloud computing; concurrent replica processing; concurrent replication; explicit analysis; general arrival process handling; grid data centers; job level; job reliability; large-scale instances; numerically-efficient algorithm; parallel job processing; parallel job replication; parallel task failure; parallelization; performance metrics; replication mechanism; resource utilization; response time; response time reduction; scientific computing; software applications; stochastic model; task cancellation mechanism; task level; Computational modeling; Equations; Generators; Numerical models; Reliability; Time factors; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on
  • Conference_Location
    Paris
  • ISSN
    1526-7539
  • Type

    conf

  • DOI
    10.1109/MASCOTS.2014.13
  • Filename
    7033635