4.5 Article

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

期刊

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING
卷 39, 期 12, 页码 8775-8791

出版社

SPRINGER HEIDELBERG
DOI: 10.1007/s13369-014-1455-2

关键词

Grid computing; Task scheduling; Fault tolerance; Checkpointing; Weibull failure distribution; Genetic algorithm

向作者/读者索取更多资源

Application scheduling is crucial for grid computing environment. The failure of grid resources poses a great challenge to it. Most existing application scheduling algorithms deal with resource failures by employing reliability-aware scheduling without considering performance and do not adequately provide fault tolerance to them. In this paper, we proposed a fault tolerant task scheduling algorithm for independent and dependent (workflows) tasks considering reliability as well as the performance of grid resources. We focused on the Weibull distributed failures of grid resources in spite of commonly adopted assumption of Poisson failure distribution. To handle such failures, rollback recovery via checkpoint/restart is used for improving system dependability and reliability. The optimal checkpointing frequency is used with the goal to minimize the fault tolerance overhead (expected waste time). Based on minimal wasted time, a new factor known as capacity decreasing factor is generated. It considers both the performance and failure characteristics of the resources. Finally, the efficient scheduling decision is made using genetic algorithm considering the capacity decreasing factor by generating the new computing capacity of the resources in the presence of failures. The efficient scheduling solution is generated having both optimal performance (makespan) and reliability (i.e., the lowest tendency to fail). Further, precedence constraint of sub-tasks is also considered, where ordering of tasks is performed considering the precedence relationship and fault tolerance overhead. The simulation results show that our proposed fault tolerant scheduling algorithm achieves better performance and execution reliability than other previous algorithms in the presence of failures.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据