- Home
- Publications
- Publication Search
- Publication Details
Title
Predictive Reliability and Fault Management in Exascale Systems
Authors
Keywords
-
Journal
ACM COMPUTING SURVEYS
Volume 53, Issue 5, Pages 1-32
Publisher
Association for Computing Machinery (ACM)
Online
2020-09-28
DOI
10.1145/3403956
References
Ask authors/readers for more resources
Related references
Note: Only part of the references are listed.- Fault tolerance of MPI applications in exascale systems: The ULFM solution
- (2020) Nuria Losada et al. Future Generation Computer Systems-The International Journal of eScience
- The Real-Time Linux Kernel
- (2019) Federico Reghenzani et al. ACM COMPUTING SURVEYS
- Probabilistic Worst-Case Timing Analysis
- (2019) Francisco J. Cazorla et al. ACM COMPUTING SURVEYS
- Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers
- (2018) Mark A. Oxley et al. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
- Exploring the capabilities of support vector machines in detecting silent data corruptions
- (2018) Omer Subasi et al. Sustainable Computing-Informatics & Systems
- Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System
- (2018) IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
- Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters
- (2017) Saurabh Jha et al. IEEE Transactions on Dependable and Secure Computing
- Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
- (2017) Sheng Di et al. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
- Toward General Software Level Silent Data Corruption Detection for Parallel Applications
- (2017) Eduardo Berrocal et al. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
- CP-FPGA: Energy-Efficient Nonvolatile FPGA With Offline/Online Checkpointing Optimization
- (2017) Zhe Yuan et al. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
- M2DC – Modular Microserver DataCentre with heterogeneous hardware
- (2017) Ariel Oleksiak et al. MICROPROCESSORS AND MICROSYSTEMS
- Measuring the Impact of Memory Errors on Application Performance
- (2017) Mark Gottscho et al. IEEE Computer Architecture Letters
- Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
- (2016) Sheng Di et al. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
- A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges
- (2016) Sukhpal Singh et al. Journal of Grid Computing
- Interpolation-Restart Strategies for Resilient Eigensolvers
- (2016) E. Agullo et al. SIAM JOURNAL ON SCIENTIFIC COMPUTING
- Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers
- (2015) Abdulla M. Al-Qawasmeh et al. IEEE TRANSACTIONS ON COMPUTERS
- Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits
- (2015) Behzad Eghbalkhah et al. MICROELECTRONICS RELIABILITY
- A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs
- (2014) Arvind Sridhar et al. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
- A survey on resource allocation in high performance distributed computing systems
- (2013) Hameed Hussain et al. PARALLEL COMPUTING
- DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures
- (2013) K. Kurowski et al. SIMULATION MODELLING PRACTICE AND THEORY
- A survey of hard real-time scheduling for multiprocessor systems
- (2011) Robert I. Davis et al. ACM COMPUTING SURVEYS
- VL2
- (2011) Albert Greenberg et al. COMMUNICATIONS OF THE ACM
- vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines
- (2011) Lin Shi et al. IEEE TRANSACTIONS ON COMPUTERS
- A survey of online failure prediction methods
- (2010) Felix Salfner et al. ACM COMPUTING SURVEYS
- OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
- (2010) John E. Stone et al. COMPUTING IN SCIENCE & ENGINEERING
- Predictable High-Performance Computing Using Feedback Control and Admission Control
- (2010) Sang-Min Park et al. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
- Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs
- (2010) Thidapat Chantem et al. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
- Thermal Balancing Policy for Multiprocessor Stream Computing Platforms
- (2009) Fabrizio Mulas et al. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
- The worst-case execution-time problem—overview of methods and survey of tools
- (2008) Reinhard Wilhelm et al. ACM Transactions on Embedded Computing Systems
- Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
- (2007) J. Langou et al. SIAM JOURNAL ON SCIENTIFIC COMPUTING
Find the ideal target journal for your manuscript
Explore over 38,000 international journals covering a vast array of academic fields.
SearchAdd your recorded webinar
Do you already have a recorded webinar? Grow your audience and get more views by easily listing your recording on Peeref.
Upload Now