☆ 4.5 Article

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

JOURNAL OF SUPERCOMPUTING (2013)

期刊

JOURNAL OF SUPERCOMPUTING

卷 65, 期 3, 页码 1302-1326

出版社

SPRINGER

DOI: 10.1007/s11227-013-0884-0

关键词

High Performance Computing (HPC); Checkpoint/restart; Fault tolerance; Clusters; Reliability; Performance

类别

Computer Science, Hardware & Architecture Computer Science, Theory & Methods Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

期刊

JOURNAL OF SUPERCOMPUTING

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

期刊

JOURNAL OF SUPERCOMPUTING

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文